# Chapter 6 R Lab 5 - 12/05/2023

In this lecture we will continue the case study introduced in Section 5. **You to run all the code from Lab 4 which includes data preparation starting from the MSFT data.**

The following packages are required:

```
library(tidyverse)
library(lubridate) #for dealing with dates
library(zoo) #for time series manipulating and modeling
library(rpart) #regression/classification tree
library(rpart.plot) #tree plots
library(randomForest) #bagging and random forecast
library(FNN) #for regression KNN
```

At the end you should have the following two datasets:

`glimpse(MSFTtr)`

```
## Rows: 1,160
## Columns: 8
## $ Date <date> 2017-05-25, 2017-05-26, 2017-05-30, 2017-05-31, 2017-06-01, 2017-06…
## $ MA2 <dbl> 0.0005927972, 0.0021850423, 0.0042008004, 0.0055389817, 0.0053356564…
## $ MA7 <dbl> 0.0000000000, 0.0001024794, 0.0013126507, 0.0026301730, 0.0036793010…
## $ MA14 <dbl> 0.0000000000, 0.0002496531, 0.0005977847, 0.0010322950, 0.0012602073…
## $ MA21 <dbl> 0.0000000000, 0.0003594390, 0.0007027827, 0.0010892422, 0.0012255786…
## $ SD7 <dbl> 0.05296239, 0.05802160, 0.06711377, 0.07466795, 0.05686829, 0.048740…
## $ lag1volume <dbl> 0.06904779, 0.13898080, 0.11946367, 0.09292550, 0.22164877, 0.136567…
## $ target <dbl> 0.006036340, 0.007553825, 0.005631552, 0.006508324, 0.012106306, 0.0…
```

`glimpse(MSFTte)`

```
## Rows: 77
## Columns: 8
## $ Date <date> 2022-01-03, 2022-01-04, 2022-01-05, 2022-01-06, 2022-01-07, 2022-01…
## $ MA2 <dbl> 1.0000000, 0.9618317, 0.9007791, 0.7473543, 0.6209897, 0.6014462, 0.…
## $ MA7 <dbl> 0.9960674, 1.0000000, 0.9855889, 0.9194452, 0.8500031, 0.7791910, 0.…
## $ MA14 <dbl> 1.0000000, 0.9925227, 0.9936001, 0.9642216, 0.9465012, 0.9308069, 0.…
## $ MA21 <dbl> 0.9876331, 0.9934102, 1.0000000, 0.9892257, 0.9661175, 0.9431301, 0.…
## $ SD7 <dbl> 0.09785490, 0.06212817, 0.21994156, 0.66243705, 0.88719009, 0.897984…
## $ lag1volume <dbl> 0.0000000, 0.1500012, 0.2025940, 0.3044882, 0.2988522, 0.2032250, 0.…
## $ target <dbl> 1.0000000, 0.7678880, 0.7219434, 0.7248838, 0.7291106, 0.7421593, 0.…
```

Let’s remove the `Date`

from the training dataset because it won’t be used for fitting the model:

`= MSFTtr %>% dplyr::select(-Date) MSFTtr `

### 6.0.1 Regression tree

We have now the data to estimate the regression tree for predicting the MSFT price, given the 6 selected regressors.

We use now the training data to estimate a regression tree by using the `rpart`

function of the `rpart`

package (see `?rpart`

). As usual we have to specify the formula, the dataset and the method (which will be set to `anova`

for a regression tree). More information about the anova methodology can be find into this document here (page 32).

```
= rpart(formula = target ~.,
singletree data = MSFTtr,
method = "anova")
```

Note that some options that control the `rpart`

algorithm are specified by means of the `rpart.control`

function (see `?rpart.control`

). In particular, the `cp`

option, set by default equal to 0.01, represents a pre-pruning step because it prevents the generation of non-significant branches (any split that does not decrease the overall lack of fit by a factor of `cp`

is not attempted). See also the meaning of the `minsplit`

, `minbucket`

and `maxdepth`

options.

It is possible to plot the estimated tree as follows:

`rpart.plot(singletree)`

The function `rpart.plot`

has different options about which information should be displayed. Here we use it with its default setting and each node provides the average price and the percentage of training observations in each node (with respect to the total number of training observations). Note that basically all the numbers in the plot are rounded.

It is also possible to analyze the structure of the tree by printing it:

` singletree`

```
## n= 1160
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1160 84.7423800 0.34201040
## 2) MA2< 0.4017356 729 7.8698120 0.16016720
## 4) MA14< 0.1809473 465 1.2662160 0.09315739
## 8) MA2< 0.0809836 183 0.1166070 0.03702299 *
## 9) MA2>=0.0809836 282 0.1987565 0.12958500 *
## 5) MA14>=0.1809473 264 0.8378773 0.27819590 *
## 3) MA2>=0.4017356 431 11.9938400 0.64958280
## 6) MA7< 0.7050037 296 1.9696930 0.55132710
## 12) MA21< 0.5648444 192 0.4525512 0.50105920 *
## 13) MA21>=0.5648444 104 0.1363078 0.64412940 *
## 7) MA7>=0.7050037 135 0.9008853 0.86501730 *
```

The most important regressor, which is located in the root node, is `MA2`

(so basically information referring to very few past data).

Given the grown tree we are now able to compute the predictions for the test set and to compute, as measure of performance, the mean squared error (\(MSE = \sum_{i=1}^n(y_i-\hat y_i)^2\) with \(i\) denoting the generic observation in the test set). The predictions can be obtained by using the `predict`

function that will return the vector of predicted prices:

```
= predict(object = singletree,
predsingletree newdata = MSFTte)
# MSE
= mean((MSFTte$target-predsingletree)^2)
MSE1tree MSE1tree
```

`## [1] 0.03220142`

We also plot the observed test data and the corresponding predictions:

```
# Define first a data.frame before using ggplot
data.frame(MSFTte, predsingletree) %>%
ggplot()+
geom_line(aes(Date,target, col="obs")) +
geom_line(aes(Date,predsingletree, col="pred1tree"))
```

We see that the regression tree is able to get the trend of the test set but it is not doing a good job in the prediction (with many period of constant prediction).

## 6.1 Pruning

After the estimation of the the decision tree it is possible to remove non-significant branches (pruning) by adopting the **cost-complexity approach** (i.e. by penalizing for the tree size). To determine the optimal cost-complexity (CP) parameter value \(\alpha\), `rpart`

automatically performs a 10-fold cross validation (see argument `xval = 10`

in the `rpart.control`

function). The complexity table is part of the standard output of the `rpart`

function and can be obtained as follows:

`$cptable singletree`

```
## CP nsplit rel error xerror xstd
## 1 0.76559954 0 1.00000000 1.00166420 0.035166329
## 2 0.10765880 1 0.23440046 0.23765846 0.009102505
## 3 0.06803819 2 0.12674166 0.12216520 0.004238147
## 4 0.01629449 3 0.05870347 0.05989170 0.002110345
## 5 0.01122051 4 0.04240898 0.04523947 0.001512709
## 6 0.01000000 5 0.03118847 0.03108513 0.001423453
```

Note that the number of splits is reported, rather than the number of nodes (however remember that the number of **final nodes** is always given by 1 + the number of splits). The tables reports, for different values of \(\alpha\) (`CP`

) the relative training error (`rel.error`

) and the cross-validation error (`xerror`

) together with its standard error (`xstd`

). Note that the `rel.error`

column is scaled so that the first value is 1.

The usual approach is to select the tree with the lowest cross-validation error (`rel error`

or `xerror`

) and to find the corresponding value of `CP`

and number of splits. In this case we observe the minimum value for `nsplit`

equal to 5, which means 6 regions. This is the full tree estimated before, so in this case pruning is not necessary.

## 6.2 Bagging

Bagging corresponds to a particular case of random forest when all the regressors are used. For this reason to implement bagging we use the function `randomForest`

in the `randomForest`

package. As the procedure is based on boostrap resampling we need to set the seed. As usual we specify the formula and the data together with the number of regressors to be used (`mtry`

, in this case equal to the number of columns minus the column dedicated to the response variable). The option `importance`

is set to `T`

when we are interested in assessing the importance of the predictors. Finally, `ntree`

represents the number of independent trees to grow (500 is the default value).

As bagging is based on bootstrap which is a random procedure it is convenient to set the seed before running the function:

```
set.seed(11) #procedure is random (bootstrap)
= randomForest(target ~ .,
bagmodel data = MSFTtr,
mtry = ncol(MSFTtr)-1,
ntree = 1000,
importance = TRUE)
```

Note that the information reported in the output (`Mean of squared residuals`

and `% Var explained`

) are computed using out-of-bag (OOB) observations.

We compute the predictions and the test MSE:

```
= predict(bagmodel, MSFTte)
predbag = mean((MSFTte$target - predbag)^2)
MSE_bag MSE_bag
```

`## [1] 0.03377854`

Note that the MSE is slightly higher than the one obtained for the single tree.

We analyze now graphically the importance of the regressors (this output is obtained by setting `importance = T`

):

`varImpPlot(bagmodel)`

The left plot reports the % increase (averaged across all \(B\) trees) of MSE (computed on the OOB samples) when a given variable is excluded. This measures how much we lose in the predictive capability by excluding each variable. The most important regressor, considering all the \(B\) trees, is MA2.

## 6.3 Random forest

Random forest is very similar to bagging. The only difference is in the specification
of the `mtry`

argument. For a regression problem we consider a number
of regressors to be selected randomly \(p/3\).

```
set.seed(11) #procedure is random (bootstrap)
= randomForest(target ~ .,
rfmodel data = MSFTtr,
mtry = (ncol(MSFTtr)-1) / 3, #p/3
ntree = 1000,
importance = TRUE)
rfmodel
```

```
##
## Call:
## randomForest(formula = target ~ ., data = MSFTtr, mtry = (ncol(MSFTtr) - 1)/3, ntree = 1000, importance = TRUE)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.0001347537
## % Var explained: 99.82
```

As before we plot the measure for the importance of the regressors:

`varImpPlot(rfmodel)`

Again `MA2`

is the most important regressor.

Finally, we compute the predictions and the test MSE by using the random forest model:

```
= predict(rfmodel, MSFTte)
predrf
= mean((MSFTte$target - predrf)^2)
MSE_rf MSE_rf
```

`## [1] 0.0435875`

It results that random forest is performing worse than bagging. We finally plot all the predictions:

```
# plot all the predicted time series
data.frame(MSFTte, predsingletree, predbag, predrf) %>%
ggplot()+
geom_line(aes(Date,target, col="obs")) +
geom_line(aes(Date,predsingletree, col="pred1tree")) +
geom_line(aes(Date,predbag, col="predbag")) +
geom_line(aes(Date,predrf, col="predrf"))
```

## 6.4 KNN regression

We finally implemented KNN regression using the `FNN`

library as described in Section 2:

```
= knn.reg(train = dplyr::select(MSFTtr, -target),
knnmod test = dplyr::select(MSFTte, -target, -Date),
y = dplyr::select(MSFTtr, target),
k = 20) # this is the tuned value of k
# you don't need to run predict
= mean((MSFTte$target - knnmod$pred)^2)
MSE_knn MSE_knn
```

`## [1] 0.04544744`

## 6.5 Final prediction

We compare the MSEs:

` MSE1tree`

`## [1] 0.03220142`

` MSE_bag`

`## [1] 0.03377854`

` MSE_rf`

`## [1] 0.0435875`

` MSE_knn`

`## [1] 0.04544744`

The method with the lowest MSE is the single regression tree.

We plot all the predictions together:

```
data.frame(MSFTte, predsingletree, predbag, predrf, knnmod$pred) %>%
ggplot()+
geom_line(aes(Date,target, col="obs")) +
geom_line(aes(Date,predsingletree, col="pred1tree")) +
geom_line(aes(Date,predbag, col="predbag")) +
geom_line(aes(Date,predrf, col="predrf")) +
geom_line(aes(Date,knnmod.pred, col="predknn"))
```