# Chapter 3 Linear regression models

## 3.1 Simple linear regression

A simple linear regression model describes the relation between a numeric Y-variable and one numeric X-variable with a linear equation. The general form is:

$$Y = \beta_0 + \beta_1X + \epsilon$$ with:

$$\beta_0$$ the intercept
$$\beta_1$$ the slope
$$\epsilon$$ the residual or error term; this term stands for the effect of other variables on Y and for random variation in the Y-variable

Applying this model on a data set means estimating the values of the model parameters, $$\beta_0$$ and $$\beta_1$$, based on the available data.

### 3.1.1 Assessing a simple linear regression model

To assess the usefulness of an estimated regression model we use the coefficient of determination, R squared, the significance of the relationship between the Y- and the X-variable and the MSE (Mean Squared Error) or the Root Mean Squared Error (RMSE).

### 3.1.2 R squared ($$R^2$$)

The coefficient of determination, $$R^2$$, equals 1 minus the quotient of the variation of the Y-values around the regression line and the total variation in the Y-values.

$$R^2 = 1-\frac{\sum(Y-\hat{Y})^2}{\sum(Y-\bar{Y})^2} = 1 - \frac{\sum e_i^2}{\sum(Y-\bar{Y})^2}$$

The numerator in the fraction in these formulas, $$\sum e_i^2$$, measures the variation in the Y-values which is not explained by the regression model. The denominator, $$\sum(Y-\bar{Y})^2$$, measures the total variation in the Y-values. So the fraction measures the proportion of unexplained variation in the Y-values; 1 minus this fraction measures the proportion of the variation in the Y-values which is explained by the regression model.

For instance, if $$R^2$$ = .65, 65% of the variation in the Y-values is explained by the model; in other words, 65% of the variation in the Y-values can be explained by the variation in the X-values.

### 3.1.3 Significance of the X-variable in the model

Another way to assess a regression model is testing the significance of the coefficient of the X-variable. This is done by testing H_A_: $$\beta_1 \neq 0$$. If no support is found in the data for this hypothesis, the X-variable is not a significant variable. Or in other words, the Y-variable doesn’t depend significantly on the X-variable. Software packages used to generate a linear regression model, will also report the p-value for this test. A p-value below .05 is commonly interpreted as a significant result.

### 3.1.4 Mean Squared Error and Root Mean Squared Error

The most used metric to assess a regression model is the MSE (Mean Squared Error) or the square root of the MSE, the RMSE (Root Mean Squared Error):
$$MSE = \frac{\sum e_i^2}{N}$$
$$RMSE = \sqrt{\frac{\sum e_i^2}{N}}$$
These metrics measure the average distance of the data points to the regression line, measured in the Y direction. It is a metric which measures the model error.

Remember that in general we are more interested in the out-of-sample performance than in the in-sample performance. In case of a (simple) linear regression model, it can be shown that the in-sample MSE is a good estimate for the out-of-sample MSE. The estimate can even be improved by using N-2 in the denominator in stead of N:
$$RMSE = \sqrt{\frac{\sum e_i^2}{N-2}}$$.

Of course the commonly used techniques to estimate the out of sample error can be used as wel, such as splitting the data set in a training and a test set, cross validation or bootstrapping. In the example in the next session the different techniques to estimate the out-of-sample performance of a linear rergession model are illustrated, using an example.

## 3.2 Example simple linear regression

As an example of applying a linear regression model, we use a dataset with a random sample of Dutch cars from three brands: AUDI, CITROEN and KIA. The data is sampled from the RDW website on August 20, 2020. Data from two databases on this website are joined into one data frame and some data cleaning has been performed.

Table 3.1
First 10 Observations in RDW Data Set

library(tidyverse)
library(flextable)
library(kableExtra)
rdw_sample <- read_csv("data/20200820rdw_sample.csv")
flextable(head(rdw_sample, n = 10)) %>% fontsize(size = 7, part = "all")
 LICENSE_PLATE BRAND COLOR ECONOMY_LABEL FUEL_DESCRIPTION CILINDER_CONTENT MASS CATALOG_PRICE SPECIAL_TAX SP599G AUDI BLACK A DIESEL 1,968 1,635 70,861 2,576 SG896T AUDI BLACK D DIESEL 2,967 1,835 94,885 9,480 50SNX1 KIA GRAY B PETROL 1,396 1,217 20,685 3,446 SX290V AUDI GRAY B PETROL 1,395 1,390 48,109 6,386 GZ086R CITROEN GRAY A PETROL 1,199 1,215 26,080 2,935 TP133J AUDI GRAY C PETROL 1,984 1,475 68,186 8,311 HX408N KIA BLACK D PETROL 1,396 1,153 21,490 4,839 67PNH9 AUDI GRAY B DIESEL 1,968 1,500 46,575 9,832 TN023P AUDI WHITE B PETROL 1,395 1,200 44,021 1,916 5SBP96 AUDI GRAY B PETROL 1,984 1,540 55,047 6,957

Figure 3.1
Summary of the RDW Dataset

summary(rdw_sample)
 LICENSE_PLATE         BRAND              COLOR           ECONOMY_LABEL
Length:8341        Length:8341        Length:8341        Length:8341
Class :character   Class :character   Class :character   Class :character
Mode  :character   Mode  :character   Mode  :character   Mode  :character

FUEL_DESCRIPTION   CILINDER_CONTENT      MASS      CATALOG_PRICE
Length:8341        Min.   : 998     Min.   : 765   Min.   : 2423
Class :character   1st Qu.:1197     1st Qu.: 989   1st Qu.:17485
Mode  :character   Median :1396     Median :1215   Median :25750
Mean   :1472     Mean   :1202   Mean   :29271
3rd Qu.:1598     3rd Qu.:1402   3rd Qu.:35980
Max.   :4163     Max.   :2420   Max.   :99983
SPECIAL_TAX
Min.   :    9
1st Qu.: 1573
Median : 3084
Mean   : 4102
3rd Qu.: 5662
Max.   :27622  

The aim is to generate a model that estimates the Catalogue Price (CATALOG_PRICE) given the value of the other variables as predictors, excepted the Special Tax variable (this variable cannot be a predictor for the catalogue price, because this special tax depends on this price).

The first model is a simple linear model with MASS as predictor for CATALOG_PRICE.

Figure 3.2
Output Simple Linear Regression Model in R

library(MOTE) #apa() function for apa representation of numbers

linmod_01 <- lm(CATALOG_PRICE ~ MASS, data = rdw_sample)
Rsquared <- apa(summary(linmod_01)$r.squared, 3, FALSE) rmse <- prettyNum(round(summary(linmod_01)$sigma, 0), big.mark = ",")
mse <- prettyNum(summary(linmod_01)$sigma^2, big.mark = ",") summary(linmod_01)  Call: lm(formula = CATALOG_PRICE ~ MASS, data = rdw_sample) Residuals: Min 1Q Median 3Q Max -45568 -5890 475 3837 48936 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.125e+04 4.434e+02 -70.47 <2e-16 *** MASS 5.033e+01 3.599e-01 139.85 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8833 on 8339 degrees of freedom Multiple R-squared: 0.7011, Adjusted R-squared: 0.701 F-statistic: 1.956e+04 on 1 and 8339 DF, p-value: < 2.2e-16 In the output can be seen that $$R^2$$ = .701; in other words 70.1% of the variation in the catalog prices is explained by the model. MASS is a significant variable in the model, P-value < .001. The RMSE, in the output named as ‘Residual standard error’, equals 8,833. The MSE is just this number squared, MSE = 78,020,205. As said before, in case of a (simple) linear regression model, the so calculated MSE is a good estimate for the out-of-sample MSE. This is the case, because with a linear regression model, possible overfitting is not really an issue. However, for more complicated models, it is. To explain the concepts for estimating out-of-sample errors when using more complex models, in the next seections the MSE is calculated in more general ways. ### 3.2.1 Estimating out-of-sample error using a training and a test set The first method splits the data randomly in a training set and a test set. The model parameters are estimated using the training set. The out-of-sample error is estimated using the test set. set.seed(20200823) prop_train <- .50 train <- sample(1:nrow(rdw_sample), prop_train*nrow(rdw_sample)) rdw_train <- rdw_sample[train,] rdw_test <- rdw_sample[-train,] linmod_02 <- lm(CATALOG_PRICE ~ MASS, data = rdw_train) predict_train <- predict(linmod_02) MSE_train <- sum((predict_train - rdw_train$CATALOG_PRICE)^2)/(nrow(rdw_train) - 2)
RMSE_train <- prettyNum(sqrt(MSE_train),
digits=0,
big.mark = ",")
#alternative
#RMSE_train <- summary(linmod_02)$sigma predict_test <- predict(linmod_02, newdata = rdw_test) MSE_test <- mean((predict_test - rdw_test$CATALOG_PRICE)^2)
RMSE_test <- prettyNum(sqrt(MSE_test),
digits=0, big.mark = ",")

The within sample RMSE, i.e. the RMSE calculated based on the training data, equals 8,673. The out-of-sample error estimate equals 8,992.

### 3.2.2 Estimating out-of-sample error using cross validation

In this section 10-fold cross validation is used to estimate the out-of-sample RMSE.

#10-fold CV
library(caret)
library(tidyverse)
library(kableExtra)
set.seed(20200824)
rdw_sample$GROUP <- sample(c(1, rep(1:10, 834))) MSEs <- NULL for (t in 1:10) { trainset <- filter(rdw_sample, GROUP != t) testset <- filter(rdw_sample, GROUP == t) linmod <- lm(CATALOG_PRICE ~ MASS, data = trainset) predict_test <- predict(linmod, newdata = testset) mse <- mean((predict_test - testset$CATALOG_PRICE)^2)
MSEs[t] = mse
}
mse_estimate <- mean(MSEs)
rmse_estimate <- sqrt(mse_estimate)

This procedure leads to 10 estimates for the out-of-sample MSE: 72,139,107, 71,120,651, 78,852,650, 74,810,525, 71,190,694, 77,776,460, 77,709,740, 84,458,930, 84,033,745, 88,307,401. The average of these estimates, 78,039,990, is the CV estimate of the out-of-sample MSE. The RMSE estimate is 8,834.
The variation in the estimates can be used to make a statement about the accuracy of the CV estimate, or to calculate an interval estimate for the out-of-sample error. To improve such an interval estimate, the CV procedure can be repeated dozens or hundreds of times; if this is done this procedure is called Repeated Cross Validation. For instance, if the procedure is repeated 200 times, this leads to 2000 estimates for the out-of-sample MSE, which can be used to calculate an interval estimate for the out-of-sample estimate. To calculate a 95% confidence interval (CI) for the out-of-smaple MSE, the .025th and the .975th quantile of the 2000 estimates are calculated. The probability that the interval between these two values contains the “real” out-of-sample MSE equals 95%.

#200 times repeated 10-fold CV
set.seed(20200829)

MSEs <- NULL

for (i in 1:200) {
rdw_sample$GROUP <- sample(c(1, rep(1:10, 834))) for (t in 1:10) { trainset <- filter(rdw_sample, GROUP != t) testset <- filter(rdw_sample, GROUP == t) linmod <- lm(CATALOG_PRICE ~ MASS, data = trainset) predict_test <- predict(linmod, newdata = testset) mse <- mean((predict_test - testset$CATALOG_PRICE)^2)
MSEs <- c(MSEs, mse)
}}

RMSEs <- sqrt(MSEs)
mse_estimate <- mean(MSEs)
mse_estimate_025 <- quantile(MSEs, .025)
mse_estimate_975 <- quantile(MSEs, .975)
rmse_estimate <- mean(sqrt(MSEs))
rmse_estimate_025 <- quantile(sqrt(MSEs), .025)
rmse_estimate_975 <- quantile(sqrt(MSEs), .975)

rmse_plot <- tibble(RMSE = RMSEs) %>%
ggplot(aes(x=RMSE)) +
geom_histogram(binwidth = 50, fill = "royalblue") +
geom_vline(aes(xintercept = rmse_estimate_025), col = "grey") +
geom_vline(aes(xintercept = rmse_estimate_975), col = "grey") +
theme_minimal()

Figure 3.3
RMSE Estimates Based on 200 times Repeated 10-folds CV

rmse_plot

Note. The gray lines mark the 95% CI for the RMSE.

In this example 200 times Repeated 10-fold CV, leads to:
point estimate for the out-of-sample MSE: 78,060,204
point estimate for the out-of-sample RMSE: 8,830
95% interval estimate for out-of-sample MSE: <67,379,651; 89,119,641>
95% interval estimate for out-of-sample RMSE: <8,209; 9,440>.

### 3.2.3 Estimating out-of-sample-error using bootstrapping

Bootstrapping is a technique of resampling from the available data set. Hundreds or even thousands of samples from the data set are drawn wtih replacement and sample size the number of observartions in the data set. This means that observations can be drwan more than one time in a bootstrap sample.
Bootstrapping is based on just one assumption: the available data form a representative sample of the research population. Under this assumption the population can be assumed to consist of a couple of thousand times this sample. Drawing a sample from this assumed population, is the same as drawing a sample from the available sample data with replacement. Bootstrap samples have the same sample size as the originale available sample data set. Under the mentioned assumption, a bootstrap sample can be seen as a random sample from the population of interest. Drawing 200 bootstrap samples yields 200 possible samples from the research population. These bootstrap samples can be used to estimate unknown population parameters, e.g. an MSE in a simple linear regression problem.

#200 bootstrap samples
set.seed(20200829)
MSEs_bootstrap <- NULL
for (i in 1:200) {
sample <- sample_n(rdw_sample, size = nrow(rdw_sample),replace = TRUE)
linmod <- lm(CATALOG_PRICE~MASS, data = sample)
predict_bootstrap <- predict(linmod)
mse <- mean((predict_bootstrap - sample$CATALOG_PRICE)^2) MSEs_bootstrap <- c(MSEs_bootstrap, mse) } RMSEs_bootstrap <- sqrt(MSEs_bootstrap) rmse_bootstrap_plot <- tibble(RMSE = RMSEs_bootstrap) %>% ggplot(aes(x=RMSE)) + geom_histogram(binwidth = 50, fill = "royalblue") + geom_vline(aes(xintercept = quantile(RMSEs_bootstrap, .025)), col = "grey") + geom_vline(aes(xintercept = quantile(RMSEs_bootstrap, .975)), col = "grey") + theme_minimal() Figure 3.4 RMSE estimates based on 200 Bootstrap Samples rmse_bootstrap_plot Results bootstrap sampling Estimate out-of-sample MSE = 78,056,696 Estimate out-of-sample RMSE = 8,834 ## 3.3 Caret Package The caret package is a usefull R package to apply a Machine Learning (ML) model on a data set. Actually it is not one package, but a combination of dozens of packages which are developed for all kinds of different ML models. The central function to apply a ML model on a sample is the train() function. This section provides the code when using the caret package, for the examples in the former sections. ### 3.3.1 Estimating MSE using bootstrap The first example is the use of the bootstrap sampling method. The reason for starting with the bootstrap sample, is because this is the default method in the caret package. Figure 3.5 The Generated Linear Model with Caret, Default Settings library(caret) set.seed(20200829) linmod_caret_bootstrap <- train(CATALOG_PRICE~MASS, data = rdw_sample, method = "lm") linmod_caret_bootstrap Linear Regression 8341 samples 1 predictor No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 8341, 8341, 8341, 8341, 8341, 8341, ... Resampling results: RMSE Rsquared MAE 8815.187 0.7015241 6444.3 Tuning parameter 'intercept' was held constant at a value of TRUE Examine the output in Table 3.1 carefully to understand how the caret package generates a linear model. As default caret produces 25 bootstrap samples to estimate the out-of-sample performance. The reported out-of-sample error is the average of 25 estimated out-of-sample errors. The reported RMSE is the bootstrap RMSE, the square root of the average of 25 bootstrapped MSE’s. Figure 3.6 The Summary of the Linear Model summary(linmod_caret_bootstrap)  Call: lm(formula = .outcome ~ ., data = dat) Residuals: Min 1Q Median 3Q Max -45568 -5890 475 3837 48936 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.125e+04 4.434e+02 -70.47 <2e-16 *** MASS 5.033e+01 3.599e-01 139.85 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8833 on 8339 degrees of freedom Multiple R-squared: 0.7011, Adjusted R-squared: 0.701 F-statistic: 1.956e+04 on 1 and 8339 DF, p-value: < 2.2e-16 Table 3.2 shows the model that is reported as the final model. This final model is based on all available data. This model will be used if predictions for new data are made with the bootstrap model. In the default setting 25 bootstrap samples are used to estimate the out-of-sample performance. With the trControl argument of the train() function, this number can be adjusted to another number. Figure 3.7 Caret Linear Model Based on 200 Bootstrap Samples library(caret) set.seed(20200829) linmod_caret_bootstrap_200 <- train(CATALOG_PRICE~MASS, data = rdw_sample, method = "lm", trControl = trainControl(number = 200)) linmod_caret_bootstrap_200 Linear Regression 8341 samples 1 predictor No pre-processing Resampling: Bootstrapped (200 reps) Summary of sample sizes: 8341, 8341, 8341, 8341, 8341, 8341, ... Resampling results: RMSE Rsquared MAE 8834.139 0.7017922 6455.909 Tuning parameter 'intercept' was held constant at a value of TRUE The summary of this model is based on all the observations in the training set and is the same as the model in Figure 3.2. Figure 3.8 Summary Caret Linear Model Based on 200 Bootstrap Samples summary(linmod_caret_bootstrap_200)  Call: lm(formula = .outcome ~ ., data = dat) Residuals: Min 1Q Median 3Q Max -45568 -5890 475 3837 48936 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.125e+04 4.434e+02 -70.47 <2e-16 *** MASS 5.033e+01 3.599e-01 139.85 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8833 on 8339 degrees of freedom Multiple R-squared: 0.7011, Adjusted R-squared: 0.701 F-statistic: 1.956e+04 on 1 and 8339 DF, p-value: < 2.2e-16 ### 3.3.2 Estimating MSE using Repeated Cross Validation The trControl() argument in the caret::train() funciton can also be used to use another method for estimating the out-of-sample performance. For instance cross validation or repeated cross validation. Figure 3.8 Caret Linear Model Based on 200 Times Repeated 10-Fold CV set.seed(20200829) linmod_caret_repeatedcv <- train(CATALOG_PRICE~MASS, data = rdw_sample, method = "lm", trControl = trainControl(method="repeatedcv", number=10, repeats = 200)) linmod_caret_repeatedcv Linear Regression 8341 samples 1 predictor No pre-processing Resampling: Cross-Validated (10 fold, repeated 200 times) Summary of sample sizes: 7506, 7508, 7506, 7508, 7508, 7507, ... Resampling results: RMSE Rsquared MAE 8828.989 0.7015469 6460.448 Tuning parameter 'intercept' was held constant at a value of TRUE Figure 3.9 Summary Caret Linear Model Based on 200 Times Repeated 10-Fold CV summary(linmod_caret_repeatedcv)  Call: lm(formula = .outcome ~ ., data = dat) Residuals: Min 1Q Median 3Q Max -45568 -5890 475 3837 48936 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.125e+04 4.434e+02 -70.47 <2e-16 *** MASS 5.033e+01 3.599e-01 139.85 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8833 on 8339 degrees of freedom Multiple R-squared: 0.7011, Adjusted R-squared: 0.701 F-statistic: 1.956e+04 on 1 and 8339 DF, p-value: < 2.2e-16 The output of the train() function is a list with lots of information. One of the components of the list is the ‘results’ component. Besides the estimated RMSE and R squared value. Figure 3.10 Results Caret Linear Model Based on 200 Times Repeated 10-Fold CV linmod_caret_repeatedcv$results
  intercept     RMSE  Rsquared      MAE RMSESD RsquaredSD    MAESD
1      TRUE 8828.989 0.7015469 6460.448 302.76 0.01411291 180.9893

If 200 times repeated 10-Fold CV is applied, 2000 linear models are generated and evaluated. The reported out-of-sample performance is the average of 2000 out-of-sample performance estimates. The reported standard deviation (RMSESD) is the standard deviation of these 2000 estimates.