Chapter 8 R Lab 6 - 04/05/2022
In this lecture we will learn how to prepare the data for implementing a machine learning algorithm when we have a time series, i.e. a set of data indexed by time.
The following packages are required:
library(tidyverse)
library(lubridate) #for dealing with dates
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(zoo) #for time series manipulating and modeling
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
8.1 Very simple example
With the following code we create a very simple data frame containing some fake data. In particular, we will have a column for the dates and a column for the y
data. Note that the function ymd
(year, month, day) from the lubridate
package is used to transform dates stored in character into “real” dates that can be used later for plotting.
= data.frame(Date = seq(ymd("2022-04-30"),
df ymd("2022-05-06"),
by="day"), #regular sequence of daily dates
y = c(1, 3, 5, 10, 3, 45, 99))
df
## Date y
## 1 2022-04-30 1
## 2 2022-05-01 3
## 3 2022-05-02 5
## 4 2022-05-03 10
## 5 2022-05-04 3
## 6 2022-05-05 45
## 7 2022-05-06 99
= df #this is a copy of df that will be used later df2
8.1.1 Lead and lag
We check now how the lead
and lag
function from the dplyr
package works. They are used to find the “previous” (with lag
) or “next” (with lead
) values in a vector. These values are useful for comparing values behind of or ahead of the current values. The first input is the data vector, the second input (1
in this case) defines the number of positions to lead or lag by.
$lag1y = dplyr::lag(df$y,1) #previous value
df$lead1y = dplyr::lead(df$y,1) #next value
df df
## Date y lag1y lead1y
## 1 2022-04-30 1 NA 3
## 2 2022-05-01 3 1 5
## 3 2022-05-02 5 3 10
## 4 2022-05-03 10 5 3
## 5 2022-05-04 3 10 45
## 6 2022-05-05 45 3 99
## 7 2022-05-06 99 45 NA
The new column lag1y
contains the lagged values of y
. For example, on 2022-05-01 we have the value (1) referring to the previous past value of y
. Instead the new column lead1y
contains the shifted values of y
when looking forward. For example, on 2022-05-01 we have the value (5) referring to the next value of y
.
It is also possible to use a larger offset, for example I want the second past value of y
:
$lag2y = dplyr::lag(df$y,2)
df df
## Date y lag1y lead1y lag2y
## 1 2022-04-30 1 NA 3 NA
## 2 2022-05-01 3 1 5 NA
## 3 2022-05-02 5 3 10 1
## 4 2022-05-03 10 5 3 3
## 5 2022-05-04 3 10 45 5
## 6 2022-05-05 45 3 99 10
## 7 2022-05-06 99 45 NA 3
8.1.2 Moving average or rolling mean
A moving average (or rolling mean or running mean) is a mean computed on a given number of values that move in time. Rolling consists in creating a rolling window with a specified width (2 or 3 values in the following) and in computing the average on the data in this window which, of course, rolls through the time series.
For example, let’s consider averages computed using two values and apply the rollmean
function from the zoo
package:
$MA2y = zoo::rollmean(df2$y, k = 2, fill = NA, align = "right")
df2 df2
## Date y MA2y
## 1 2022-04-30 1 NA
## 2 2022-05-01 3 2.0
## 3 2022-05-02 5 4.0
## 4 2022-05-03 10 7.5
## 5 2022-05-04 3 6.5
## 6 2022-05-05 45 24.0
## 7 2022-05-06 99 72.0
The second value (2) in the new column MA2y
is the mean of the value observed on the same day (2022-05-01) and the value obseved in the previous day (2022-04-30). Similarly, 72 is the average between 99 and 45. Note that it is not possible to compute the two terms rolling mean for 2022-04-30 because there is no value from the previous day. It is also possible to compute a rolling mean with 3 terms (change the value of k
in the function):
$MA3y = zoo::rollmean(df2$y, k = 3, fill = NA, align = "right")
df2 df2
## Date y MA2y MA3y
## 1 2022-04-30 1 NA NA
## 2 2022-05-01 3 2.0 NA
## 3 2022-05-02 5 4.0 3.00000
## 4 2022-05-03 10 7.5 6.00000
## 5 2022-05-04 3 6.5 6.00000
## 6 2022-05-05 45 24.0 19.33333
## 7 2022-05-06 99 72.0 49.00000
After 3 missing values, the first value is 3 which is given by the average of 5, 3 and 1.
We want now to lag the new columns MA2y
and MA3y
, so that for a specific day we have the rolling mean referred only to past data (without including in the computation of the rolling mean the actual value):
$lag1MA2y = dplyr::lag(df2$MA2, 1)
df2$lag1MA3y = dplyr::lag(df2$MA3, 1)
df2 df2
## Date y MA2y MA3y lag1MA2y lag1MA3y
## 1 2022-04-30 1 NA NA NA NA
## 2 2022-05-01 3 2.0 NA NA NA
## 3 2022-05-02 5 4.0 3.00000 2.0 NA
## 4 2022-05-03 10 7.5 6.00000 4.0 3.00000
## 5 2022-05-04 3 6.5 6.00000 7.5 6.00000
## 6 2022-05-05 45 24.0 19.33333 6.5 6.00000
## 7 2022-05-06 99 72.0 49.00000 24.0 19.33333
In this case the value of lag1MA2y
that we have for 2022-05-02 is the mean of the past two values (1 and 3). Similarly, the value of lag2MA2y
that we have for 2022-05-03 is the mean of the past three values (1, 3 and 5).
8.1.3 Rolling standard deviation
Following an approach similar to the one for computing the rolling mean, it is possible to compute the rolling standard deviation in order to measure the variability of a given number of past values (3 in the following case). The following code, adopting the rollapply
and lag
functions, is used:
$SD3 = zoo::rollapply(data = df$y,
df2width = 3,
align ="right",
FUN="sd",
fill=NA)
df2
## Date y MA2y MA3y lag1MA2y lag1MA3y SD3
## 1 2022-04-30 1 NA NA NA NA NA
## 2 2022-05-01 3 2.0 NA NA NA NA
## 3 2022-05-02 5 4.0 3.00000 2.0 NA 2.000000
## 4 2022-05-03 10 7.5 6.00000 4.0 3.00000 3.605551
## 5 2022-05-04 3 6.5 6.00000 7.5 6.00000 3.605551
## 6 2022-05-05 45 24.0 19.33333 6.5 6.00000 22.501852
## 7 2022-05-06 99 72.0 49.00000 24.0 19.33333 48.124838
$lag1SD3 = dplyr::lag(df2$SD3,1)
df2 df2
## Date y MA2y MA3y lag1MA2y lag1MA3y SD3 lag1SD3
## 1 2022-04-30 1 NA NA NA NA NA NA
## 2 2022-05-01 3 2.0 NA NA NA NA NA
## 3 2022-05-02 5 4.0 3.00000 2.0 NA 2.000000 NA
## 4 2022-05-03 10 7.5 6.00000 4.0 3.00000 3.605551 2.000000
## 5 2022-05-04 3 6.5 6.00000 7.5 6.00000 3.605551 3.605551
## 6 2022-05-05 45 24.0 19.33333 6.5 6.00000 22.501852 3.605551
## 7 2022-05-06 99 72.0 49.00000 24.0 19.33333 48.124838 22.501852
Note that the first available value (2) available for 2022-05-03 is the standard deviation of the past 3 values:
sd(c(1,3,5))
## [1] 2
8.1.4 Definition of the target variable
Remember that we want to predict the next day value of y
using all the past information that are available at each day. In order to align correctly the data for this supervised problem, we need to shift the y
column using the lead
function, so that for each day we have features calculated from past days but y
is from upcoming day.
$target = dplyr::lead(df2$y, 1)
df2 df2
## Date y MA2y MA3y lag1MA2y lag1MA3y SD3 lag1SD3 target
## 1 2022-04-30 1 NA NA NA NA NA NA 3
## 2 2022-05-01 3 2.0 NA NA NA NA NA 5
## 3 2022-05-02 5 4.0 3.00000 2.0 NA 2.000000 NA 10
## 4 2022-05-03 10 7.5 6.00000 4.0 3.00000 3.605551 2.000000 3
## 5 2022-05-04 3 6.5 6.00000 7.5 6.00000 3.605551 3.605551 45
## 6 2022-05-05 45 24.0 19.33333 6.5 6.00000 22.501852 3.605551 99
## 7 2022-05-06 99 72.0 49.00000 24.0 19.33333 48.124838 22.501852 NA
For example, the target value 3 that is available for 2022-04-30 is the value of y
occuring in the next day and that we want to predict.
8.1.5 Removing the missing values and the undesidered columns
We conclude with simple example by removing all the rows containing at least one NA and by considering only the variables that should be used for the model estimation (target variables + regressors):
= df2 %>%
df3 na.omit() %>%
::select(-y, -MA2y,-MA3y,-SD3)
dplyr df3
## Date lag1MA2y lag1MA3y lag1SD3 target
## 4 2022-05-03 4.0 3 2.000000 3
## 5 2022-05-04 7.5 6 3.605551 45
## 6 2022-05-05 6.5 6 3.605551 99
8.2 Prediction of the Microsoft adjusted close prices
We consider now the file named MSFT_data.csv
containing the prices time series for the MICROSOFT stock (source: Yahoo finance). For the analysis the following packages will be necessary:
library(rpart) #regression tree
library(rpart.plot) #tree plots
library(randomForest) #bagging and random forecast
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
Moreover, we define a new function that will be used later for normalizing the data. Given a vector of numerical values, normalization consists in computing for each value the following quantity: \[ \frac{x-\min(x)}{\max(x)-\min(x)} \] ] The resulting new vector will contain values included between 0 and 1. Normalization is basically a linear transformation of the values and is useful when working with variables with completely different ranges (after the normalization the range will be [0,1] for all the variables). The function definition follows:
= function(x){
normalize -min(x))/(max(x)-min(x))
(x }
In order to try the new function named normalize
we simulate a vector of values from the Normal(10,4) distribution and then normalize them:
= rnorm(10, mean=10, sd=4)
x x
## [1] 5.338675 8.104326 8.927910 14.008128 10.514855 3.750545 2.315872
## [8] 4.983321 4.929259 8.064524
range(x)
## [1] 2.315872 14.008128
normalize(x)
## [1] 0.2585303 0.4950673 0.5655058 1.0000000 0.7012319 0.1227028 0.0000000
## [8] 0.2281381 0.2235143 0.4916632
range(normalize(x)) #now it's [0,1]
## [1] 0 1
8.2.1 Data import and manipulation
We import the data:
<- read.csv("./files/MSFT_data.csv")
MSFT head(MSFT)
## Date Adj.Close Close High Low Open Volume
## 1 2017-04-26 63.31828 67.83 68.31 67.62 68.08 26190800
## 2 2017-04-27 63.72900 68.27 68.38 67.58 68.15 34971000
## 3 2017-04-28 63.90636 68.46 69.14 67.69 68.91 39548800
## 4 2017-05-01 64.79317 69.41 69.55 68.50 68.68 31954400
## 5 2017-05-02 64.69049 69.30 69.71 69.13 69.71 23906100
## 6 2017-05-03 64.48512 69.08 69.38 68.71 69.38 28928000
glimpse(MSFT)
## Rows: 1,259
## Columns: 7
## $ Date <fct> 2017-04-26, 2017-04-27, 2017-04-28, 2017-05-01, 2017-05-02, …
## $ Adj.Close <dbl> 63.31828, 63.72900, 63.90636, 64.79317, 64.69049, 64.48512, …
## $ Close <dbl> 67.83, 68.27, 68.46, 69.41, 69.30, 69.08, 68.81, 69.00, 68.9…
## $ High <dbl> 68.31, 68.38, 69.14, 69.55, 69.71, 69.38, 69.08, 69.03, 69.0…
## $ Low <dbl> 67.62, 67.58, 67.69, 68.50, 69.13, 68.71, 68.64, 68.49, 68.4…
## $ Open <dbl> 68.08, 68.15, 68.91, 68.68, 69.71, 69.38, 69.03, 68.90, 68.9…
## $ Volume <int> 26190800, 34971000, 39548800, 31954400, 23906100, 28928000, …
The following variables are available for 1259 days: the date, the adjusted closing price, the closing and opening price, the highest and lowest price during the day and the exchange volume.
Note that the first column contains the dates which are coded as characters (without any temporal meaning). It is thus important to transform these strings of text into “real” dates with a temporal ordering. To do this the function ymd
from the lubridate
package is used (the old column Date
is substituted):
$Date = ymd(MSFT$Date)
MSFTrange(MSFT$Date)
## [1] "2017-04-26" "2022-04-25"
The variable we are interested in is the adjusted close price represented in the following plot:
%>%
MSFT ggplot() +
geom_line(aes(Date,Adj.Close))
It’s a non-stationary time series, with an increasing trend in time and period of extremely high variability.
8.2.2 Preparation of the regressors and of the response variable
For the supervised problem we will use the following regressors:
- 2/7/14/21 days moving average (see @{rollingmean}) computed using past
Adj.Close
values - 7 days rolling standard deviation computed using past
Adj.Close
values - previous day
Volume
The following code is used for creating in the MSFT
the new regressors’ columns:
$MA2 = zoo::rollmean(MSFT$Adj.Close,
MSFTk = 2, fill = NA, align = "right") %>%
::lag(1)
dplyr$MA7 = zoo::rollmean(MSFT$Adj.Close,
MSFTk = 7, fill = NA, align = "right") %>%
::lag(1)
dplyr$MA14 = zoo::rollmean(MSFT$Adj.Close,
MSFTk = 14, fill = NA, align = "right") %>%
::lag(1)
dplyr$MA21 = zoo::rollmean(MSFT$Adj.Close,
MSFTk = 21, fill = NA, align = "right") %>%
::lag(1)
dplyr
$SD7 = rollapply(data = MSFT$Adj.Close,
MSFTwidth = 7,
align ="right",
FUN = sd,
fill = NA) %>%
::lag(1)
dplyr
$lag1volume = dplyr::lag(MSFT$Volume,1)
MSFT
head(MSFT, 25) #first top 25 rows of the df
## Date Adj.Close Close High Low Open Volume MA2 MA7
## 1 2017-04-26 63.31828 67.83 68.31 67.62 68.08 26190800 NA NA
## 2 2017-04-27 63.72900 68.27 68.38 67.58 68.15 34971000 NA NA
## 3 2017-04-28 63.90636 68.46 69.14 67.69 68.91 39548800 63.52364 NA
## 4 2017-05-01 64.79317 69.41 69.55 68.50 68.68 31954400 63.81768 NA
## 5 2017-05-02 64.69049 69.30 69.71 69.13 69.71 23906100 64.34976 NA
## 6 2017-05-03 64.48512 69.08 69.38 68.71 69.38 28928000 64.74183 NA
## 7 2017-05-04 64.23309 68.81 69.08 68.64 69.03 21749400 64.58780 NA
## 8 2017-05-05 64.41046 69.00 69.03 68.49 68.90 19128800 64.35910 64.16507
## 9 2017-05-08 64.35445 68.94 69.05 68.42 68.97 18566100 64.32178 64.32110
## 10 2017-05-09 64.44778 69.04 69.28 68.68 68.86 22858400 64.38245 64.41045
## 11 2017-05-10 64.69983 69.31 69.56 68.92 68.99 17977800 64.40112 64.48779
## 12 2017-05-11 63.90636 68.46 68.73 68.12 68.36 28789400 64.57381 64.47446
## 13 2017-05-12 63.83168 68.38 68.61 68.04 68.61 18714100 64.30309 64.36244
## 14 2017-05-15 63.87835 68.43 68.48 67.57 68.14 31530300 63.86902 64.26909
## 15 2017-05-16 65.16456 69.41 69.44 68.16 68.23 34956000 63.85502 64.21842
## 16 2017-05-17 63.35262 67.48 69.10 67.43 68.89 30548800 64.52146 64.32614
## 17 2017-05-18 63.56855 67.71 68.13 67.14 67.40 25201300 64.25859 64.18303
## 18 2017-05-19 63.54977 67.69 68.10 67.43 67.50 26961100 63.46059 64.05742
## 19 2017-05-22 64.26327 68.45 68.50 67.50 67.89 16237600 63.55916 63.89313
## 20 2017-05-23 64.47922 68.68 68.75 68.38 68.72 15425800 63.90652 63.94412
## 21 2017-05-24 64.56371 68.77 68.88 68.45 68.87 14593900 64.37124 64.03662
## 22 2017-05-25 65.36172 69.62 69.88 68.91 68.97 21854100 64.52146 64.13453
## 23 2017-05-26 65.68095 69.96 70.22 69.52 69.80 19827900 64.96272 64.16269
## 24 2017-05-30 66.10342 70.41 70.41 69.77 69.79 17072800 65.52134 64.49531
## 25 2017-05-31 65.56825 69.84 70.74 69.81 70.53 30436400 65.89218 64.85744
## MA14 MA21 SD7 lag1volume
## 1 NA NA NA NA
## 2 NA NA NA 26190800
## 3 NA NA NA 34971000
## 4 NA NA NA 39548800
## 5 NA NA NA 31954400
## 6 NA NA NA 23906100
## 7 NA NA NA 28928000
## 8 NA NA 0.5403347 21749400
## 9 NA NA 0.3925384 19128800
## 10 NA NA 0.2941575 18566100
## 11 NA NA 0.1934688 22858400
## 12 NA NA 0.1708033 17977800
## 13 NA NA 0.2460612 28789400
## 14 NA NA 0.3079325 18714100
## 15 64.19174 NA 0.3421353 31530300
## 16 64.32362 NA 0.4965557 34956000
## 17 64.29674 NA 0.6168428 30548800
## 18 64.27261 NA 0.6429121 25201300
## 19 64.18379 NA 0.5966703 26961100
## 20 64.15328 NA 0.6130147 16237600
## 21 64.15286 NA 0.6414197 15425800
## 22 64.17647 64.17267 0.6651045 14593900
## 23 64.24442 64.26998 0.7180622 21854100
## 24 64.33917 64.36293 0.8132354 19827900
## 25 64.45743 64.46755 0.8923094 17072800
The following code returns the plot containing the original time series together with two rolling mean series and the rolling sd time series:
# Plot adjusted close prices + MA + SD
%>%
MSFT ggplot() +
geom_line(aes(Date,Adj.Close)) +
geom_line(aes(Date,MA2,col="MA2")) +
geom_line(aes(Date,MA21,col="MA21"),lwd=1.2) +
geom_line(aes(Date,SD7,col="SD7"))
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 21 row(s) containing missing values (geom_path).
## Warning: Removed 7 row(s) containing missing values (geom_path).
The option col="..."
is not used here for setting a specific color (which will be instead chosen automatically by R
) but only for creating easily a legend with the lines names (otherwise we should reshape the code and more code is needed…). The MA2
time series is very similar to the original price series, while the MA21
series is smoother (it catches the general trend but not the fine details). The SD7
has a different range of values and shows peaks during periods of high volatility.
We finally create a new column for the response variable, represented by the next day price (see the previous simple example):
$target = lead(MSFT$Adj.Close, 1)
MSFThead(MSFT)
## Date Adj.Close Close High Low Open Volume MA2 MA7 MA14 MA21
## 1 2017-04-26 63.31828 67.83 68.31 67.62 68.08 26190800 NA NA NA NA
## 2 2017-04-27 63.72900 68.27 68.38 67.58 68.15 34971000 NA NA NA NA
## 3 2017-04-28 63.90636 68.46 69.14 67.69 68.91 39548800 63.52364 NA NA NA
## 4 2017-05-01 64.79317 69.41 69.55 68.50 68.68 31954400 63.81768 NA NA NA
## 5 2017-05-02 64.69049 69.30 69.71 69.13 69.71 23906100 64.34976 NA NA NA
## 6 2017-05-03 64.48512 69.08 69.38 68.71 69.38 28928000 64.74183 NA NA NA
## SD7 lag1volume target
## 1 NA NA 63.72900
## 2 NA 26190800 63.90636
## 3 NA 34971000 64.79317
## 4 NA 39548800 64.69049
## 5 NA 31954400 64.48512
## 6 NA 23906100 64.23309
We then remove all the rows containing missing values and select only the useful columns:
= MSFT %>%
MSFT na.omit() %>%
::select(Date, MA2, MA7, MA14, MA21, SD7, lag1volume, target)
dplyr
head(MSFT)
## Date MA2 MA7 MA14 MA21 SD7 lag1volume target
## 22 2017-05-25 64.52146 64.13453 64.17647 64.17267 0.6651045 14593900 65.68095
## 23 2017-05-26 64.96272 64.16269 64.24442 64.26998 0.7180622 21854100 66.10342
## 24 2017-05-30 65.52134 64.49531 64.33917 64.36293 0.8132354 19827900 65.56825
## 25 2017-05-31 65.89218 64.85744 64.45743 64.46755 0.8923094 17072800 65.81235
## 26 2017-06-01 65.83583 65.14579 64.51946 64.50446 0.7059899 30436400 67.37083
## 27 2017-06-02 65.69030 65.36709 64.65560 64.55788 0.6209101 21603600 67.85901
The following code represents graphically the correlation between the considered variables:
library(ggcorrplot)
ggcorrplot(cor(MSFT[,-1]),
hc.order = TRUE,
type = "lower",
lab = TRUE, digits=4)
8.2.3 Training and test set
When we work with time series we want to create the training and test set without breaking the temporal sequence of the data (and the related temporal correlation). For this reason we don’t select training data randomly (as done in Section 6.1.1, but we select the observations from 2017 to the end of 2021 as training data and the remaining data for the test (i.e. for evaluating the predictive performance of the model). The row selection is done using the information about the year which can be extracted from the full date using the year
function from lubridate
:
= MSFT %>%
MSFTtr filter(year(Date) < 2022)
= MSFT %>%
MSFTte filter(year(Date) >= 2022)
With this approach the percentage of data dedicated to the training is the following:
nrow(MSFTtr)/nrow(MSFT)*100
## [1] 93.77526
The number of test observations is equal to 77.
8.2.4 Data normalization
By using the function normalize
we normalized all the columns in the MSFT
dataframe but not Date
. To implement the normalization functionn by column (i.e. separately for each variable) we will use the function apply
(see ?apply
):
= data.frame(apply(MSFTtr[,-1],2,normalize))
MSFTtr head(MSFTtr)
## MA2 MA7 MA14 MA21 SD7 lag1volume
## 1 0.0005927972 0.0000000000 0.0000000000 0.0000000000 0.05296239 0.06904779
## 2 0.0021850423 0.0001024794 0.0002496531 0.0003594390 0.05802160 0.13898080
## 3 0.0042008004 0.0013126507 0.0005977847 0.0007027827 0.06711377 0.11946367
## 4 0.0055389817 0.0026301730 0.0010322950 0.0010892422 0.07466795 0.09292550
## 5 0.0053356564 0.0036793010 0.0012602073 0.0012255786 0.05686829 0.22164877
## 6 0.0048105012 0.0044844521 0.0017604226 0.0014229115 0.04874037 0.13656789
## target
## 1 0.006036340
## 2 0.007553825
## 3 0.005631552
## 4 0.006508324
## 5 0.012106306
## 6 0.013859796
= data.frame(apply(MSFTte[,-1],2,normalize))
MSFTte head(MSFTte)
## MA2 MA7 MA14 MA21 SD7 lag1volume target
## 1 1.0000000 0.9960674 1.0000000 0.9876331 0.09785490 0.0000000 1.0000000
## 2 0.9618317 1.0000000 0.9925227 0.9934102 0.06212817 0.1500012 0.7678880
## 3 0.9007791 0.9855889 0.9936001 1.0000000 0.21994156 0.2025940 0.7219434
## 4 0.7473543 0.9194452 0.9642216 0.9892257 0.66243705 0.3044882 0.7248838
## 5 0.6209897 0.8500031 0.9465012 0.9661175 0.88719009 0.2988522 0.7291106
## 6 0.6014462 0.7791910 0.9308069 0.9431301 0.89798474 0.2032250 0.7421593
The second input of the apply
function is the MARGIN
option. It is equal to 2 if you want to compute the required function (normalize
) column by column. Otherwise it can be 1 if you want to do the computation row by row. We use the data.frame
in order to be sure that the final object is a data frame (and not a matrix).
8.2.5 Regression tree
We have now the data to estimate the regression tree for predicting the MSFT price, given the 6 selected regressors.
We use now the training data to estimate a regression tree by using the rpart
function of the rpart
package (see ?rpart
). As usual we have to specify the formula, the dataset and the method (which will be set to anova
for a regression tree). More information about the anova methodology can be find into this document here (page 32).
= rpart(formula = target ~.,
pricetree data = MSFTtr,
method = "anova")
Note that some options that control the rpart
algorithm are specified by means of the rpart.control
function (see ?rpart.control
). In particular, the cp
option, set by default equal to 0.01, represents a pre-pruning step because it prevents the generation of non-significant branches (any split that does not decrease the overall lack of fit by a factor of cp
is not attempted). See also the meaning of the minsplit
, minbucket
and maxdepth
options.
It is possible to plot the estimated tree as follows:
rpart.plot(pricetree)
The function rpart.plot
has different options about which information should be displayed. Here we use it with its default setting and each node provides the average price and the percentage of training observations in each node (with respect to the total number of training observations). Note that basically all the numbers in the plot are rounded.
It is also possible to analyze the structure of the tree by printing it:
pricetree
## n= 1160
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1160 84.7423800 0.34201040
## 2) MA2< 0.4017356 729 7.8698120 0.16016720
## 4) MA14< 0.1809473 465 1.2662160 0.09315739
## 8) MA2< 0.0809836 183 0.1166070 0.03702299 *
## 9) MA2>=0.0809836 282 0.1987565 0.12958500 *
## 5) MA14>=0.1809473 264 0.8378773 0.27819590 *
## 3) MA2>=0.4017356 431 11.9938400 0.64958280
## 6) MA7< 0.7050037 296 1.9696930 0.55132710
## 12) MA21< 0.5648444 192 0.4525512 0.50105920 *
## 13) MA21>=0.5648444 104 0.1363078 0.64412940 *
## 7) MA7>=0.7050037 135 0.9008853 0.86501730 *
The most important regressor, which is located in the root node, is MA2
(so basically information referring to very few past data).
Given the grown tree we are now able to compute the predictions for the test set and to compute, as measure of performance, the mean squared error (\(MSE = \sum_{i=1}^n(y_i-\hat y_i)^2\) with \(i\) denoting the generic observation in the test set). The predictions can be obtained by using the predict
function that will return the vector of predicted prices:
= predict(object = pricetree,
pricepred newdata = MSFTte)
head(pricepred)
## 1 2 3 4 5 6
## 0.8650173 0.8650173 0.8650173 0.8650173 0.8650173 0.8650173
# MSE
mean((MSFTte$target - pricepred)^2)
## [1] 0.03220142
We also predict the observed test data and the corresponding predictions:
# Define first a data.frame before using ggplot
data.frame(obs = MSFTte$target, pred = pricepred) %>%
ggplot() +
geom_line(aes(1:nrow(MSFTte),obs, col="obs")) +
geom_line(aes(1:nrow(MSFTte),pred, col="pred"))
We see that the regression tree is able to get the trend of the test set but it is not doing a good job in the prediction (with many period of constant prediction). We will try to do better by using an ensemble method, like bagging, random forest or boosting.
8.3 Exercises Lab 6
8.3.1 Exercise 1
Consider the data in the file Adjcloseprices.csv
: it contains the adjusted close prices for Apple (AAPL) and the values of the NASDAQ index (IXIC). There is also a column containing dates.
- Import the data and check the length of the time series.
- Plot the time series for APPL and the NASDAQ index (to do this you have first to set properly the dates).
- Knowing that AAPL price is the target variable, compute the following regressors and set properly the data frame:
- moving average of the past 2/7/14 and 21 AAPL prices
- rolling standard deviation of the past 7 and 21 AAPL prices
- the past two values of the NASDAQ index.
Include also the column for the prices to be predicted.
Consider the first 70% of the data for training and the remaining data for test. Prepare the two data frames, removing the missing values, and then normalize the data.
Implement a regression tree for predicting AAPL price using the 7 regressors defined above. Plot the estimated regression tree.
Given the above computed tree, compute the MSE and plot in the same chart predicted and observed AAPL values.