Chapter 5 R Lab 4 - 05/05/2023
In this lecture we will learn how to prepare the data for implementing a machine learning algorithm when we have a time series, i.e. a set of data indexed by time.
The following packages are required:
library(tidyverse)
library(lubridate) #for dealing with dates
library(zoo) #for time series manipulating and modeling
5.1 Very simple example
With the following code we create a very simple data frame containing some fake data. In particular, we will have a column for the dates and a column for the y
data. Note that the function ymd
(year, month, day) from the lubridate
package is used to transform dates stored in character into “real” dates that can be used later for plotting.
= data.frame(Date = seq(ymd("2022-04-30"),
df ymd("2022-05-06"),
by="day"), #regular sequence of daily dates
y = c(1, 3, 5, 10, 3, 45, 99))
df
## Date y
## 1 2022-04-30 1
## 2 2022-05-01 3
## 3 2022-05-02 5
## 4 2022-05-03 10
## 5 2022-05-04 3
## 6 2022-05-05 45
## 7 2022-05-06 99
= df #this is a copy of df that will be used later df2
5.1.1 Lead and lag
We check now how the lead
and lag
function from the dplyr
package works. They are used to find the “previous” (with lag
) or “next” (with lead
) values in a vector. These values are useful for comparing values behind of or ahead of the current values. The first input is the data vector, the second input (1
in this case) defines the number of positions to lead or lag by.
$lag1y = dplyr::lag(df$y,1) #previous value
df$lead1y = dplyr::lead(df$y,1) #next value
df df
## Date y lag1y lead1y
## 1 2022-04-30 1 NA 3
## 2 2022-05-01 3 1 5
## 3 2022-05-02 5 3 10
## 4 2022-05-03 10 5 3
## 5 2022-05-04 3 10 45
## 6 2022-05-05 45 3 99
## 7 2022-05-06 99 45 NA
The new column lag1y
contains the lagged values of y
. For example, on 2022-05-01 we have the value (1) referring to the previous past value of y
. Instead the new column lead1y
contains the shifted values of y
when looking forward. For example, on 2022-05-01 we have the value (5) referring to the next value of y
.
It is also possible to use a larger offset, for example I want the second past value of y
:
$lag2y = dplyr::lag(df$y,2)
df df
## Date y lag1y lead1y lag2y
## 1 2022-04-30 1 NA 3 NA
## 2 2022-05-01 3 1 5 NA
## 3 2022-05-02 5 3 10 1
## 4 2022-05-03 10 5 3 3
## 5 2022-05-04 3 10 45 5
## 6 2022-05-05 45 3 99 10
## 7 2022-05-06 99 45 NA 3
Obviously in this case the first value will be available for the third day (02/05/2022).
5.1.2 Moving average or rolling mean
A moving average (or rolling mean or running mean) is a mean computed on a given number of values that move in time. Rolling consists in creating a rolling window with a specified width (2 or 3 values in the following) and in computing the average on the data in this window which, of course, rolls through the time series.
For example, let’s consider averages computed using two values and apply the rollmean
function from the zoo
package:
$MA2y = zoo::rollmean(df2$y, k = 2, fill = NA, align = "right")
df2 df2
## Date y MA2y
## 1 2022-04-30 1 NA
## 2 2022-05-01 3 2.0
## 3 2022-05-02 5 4.0
## 4 2022-05-03 10 7.5
## 5 2022-05-04 3 6.5
## 6 2022-05-05 45 24.0
## 7 2022-05-06 99 72.0
The second value (2) in the new column MA2y
is the mean of the value observed on the same day (2022-05-01) and the value observed in the previous day (2022-04-30). Similarly, 72 is the average between 99 and 45. Note that it is not possible to compute the two terms rolling mean for 2022-04-30 because there is no value from the previous day (in this case we will get NA
thanks to the option fill=NA
). align="right"
means that the rolling mean is computed using the value on a specific line (which is the one on the right side in the temporal window) and the previous one.
It is also possible to compute a rolling mean with 3 terms (change the value of k
in the function):
$MA3y = zoo::rollmean(df2$y, k = 3, fill = NA, align = "right")
df2 df2
## Date y MA2y MA3y
## 1 2022-04-30 1 NA NA
## 2 2022-05-01 3 2.0 NA
## 3 2022-05-02 5 4.0 3.00000
## 4 2022-05-03 10 7.5 6.00000
## 5 2022-05-04 3 6.5 6.00000
## 6 2022-05-05 45 24.0 19.33333
## 7 2022-05-06 99 72.0 49.00000
After 3 missing values, the first value is 3 which is given by the average of 5, 3 and 1.
We want now to lag the new columns MA2y
and MA3y
, so that for a specific day we have the rolling mean referred only to past data (without including in the computation of the rolling mean the actual value):
$lag1MA2y = dplyr::lag(df2$MA2, 1)
df2$lag1MA3y = dplyr::lag(df2$MA3, 1)
df2 df2
## Date y MA2y MA3y lag1MA2y lag1MA3y
## 1 2022-04-30 1 NA NA NA NA
## 2 2022-05-01 3 2.0 NA NA NA
## 3 2022-05-02 5 4.0 3.00000 2.0 NA
## 4 2022-05-03 10 7.5 6.00000 4.0 3.00000
## 5 2022-05-04 3 6.5 6.00000 7.5 6.00000
## 6 2022-05-05 45 24.0 19.33333 6.5 6.00000
## 7 2022-05-06 99 72.0 49.00000 24.0 19.33333
In this case the value of lag1MA2y
that we have for 2022-05-02 is the mean of the past two values (1 and 3). Similarly, the value of lag2MA2y
that we have for 2022-05-03 is the mean of the past three values (1, 3 and 5).
5.1.3 Rolling standard deviation
Following an approach similar to the one for computing the rolling mean, it is possible to compute the rolling standard deviation in order to measure the variability of a given number of past values (3 in the following case). The following code, adopting the rollapply
and lag
functions, is used:
$SD3 = zoo::rollapply(data = df$y,
df2width = 3,
align ="right",
FUN="sd",
fill=NA)
df2
## Date y MA2y MA3y lag1MA2y lag1MA3y SD3
## 1 2022-04-30 1 NA NA NA NA NA
## 2 2022-05-01 3 2.0 NA NA NA NA
## 3 2022-05-02 5 4.0 3.00000 2.0 NA 2.000000
## 4 2022-05-03 10 7.5 6.00000 4.0 3.00000 3.605551
## 5 2022-05-04 3 6.5 6.00000 7.5 6.00000 3.605551
## 6 2022-05-05 45 24.0 19.33333 6.5 6.00000 22.501852
## 7 2022-05-06 99 72.0 49.00000 24.0 19.33333 48.124838
$lag1SD3 = dplyr::lag(df2$SD3,1)
df2 df2
## Date y MA2y MA3y lag1MA2y lag1MA3y SD3 lag1SD3
## 1 2022-04-30 1 NA NA NA NA NA NA
## 2 2022-05-01 3 2.0 NA NA NA NA NA
## 3 2022-05-02 5 4.0 3.00000 2.0 NA 2.000000 NA
## 4 2022-05-03 10 7.5 6.00000 4.0 3.00000 3.605551 2.000000
## 5 2022-05-04 3 6.5 6.00000 7.5 6.00000 3.605551 3.605551
## 6 2022-05-05 45 24.0 19.33333 6.5 6.00000 22.501852 3.605551
## 7 2022-05-06 99 72.0 49.00000 24.0 19.33333 48.124838 22.501852
Note that the first available value (2) available for 2022-05-03 is the standard deviation of the past 3 values:
sd(c(1,3,5))
## [1] 2
5.1.4 Definition of the target variable
Remember that we want to predict the next day value of y
using all the past information that are available at each day. In order to align correctly the data for this supervised problem, we need to shift the y
column using the lead
function, so that for each day we have features calculated from past days but y
is from upcoming day.
$target = dplyr::lead(df2$y, 1)
df2 df2
## Date y MA2y MA3y lag1MA2y lag1MA3y SD3 lag1SD3 target
## 1 2022-04-30 1 NA NA NA NA NA NA 3
## 2 2022-05-01 3 2.0 NA NA NA NA NA 5
## 3 2022-05-02 5 4.0 3.00000 2.0 NA 2.000000 NA 10
## 4 2022-05-03 10 7.5 6.00000 4.0 3.00000 3.605551 2.000000 3
## 5 2022-05-04 3 6.5 6.00000 7.5 6.00000 3.605551 3.605551 45
## 6 2022-05-05 45 24.0 19.33333 6.5 6.00000 22.501852 3.605551 99
## 7 2022-05-06 99 72.0 49.00000 24.0 19.33333 48.124838 22.501852 NA
For example, the target value 3 that is available for 2022-04-30 is the value of y
occuring in the next day and that we want to predict using the information computed data from past days before 2022-04-30.
5.1.5 Removing the missing values and the undesidered columns
We conclude with simple example by removing all the rows containing at least one NA and by considering only the variables that should be used for the model estimation (target variables + regressors):
= df2 %>%
df3 na.omit() %>%
::select(-y, -MA2y,-MA3y,-SD3)
dplyr df3
## Date lag1MA2y lag1MA3y lag1SD3 target
## 4 2022-05-03 4.0 3 2.000000 3
## 5 2022-05-04 7.5 6 3.605551 45
## 6 2022-05-05 6.5 6 3.605551 99
5.2 Prediction of the MSFT adj prices
We consider now the file named MSFT_data.csv
containing the prices time series for the MICROSOFT stock (source: Yahoo finance). For the analysis the following packages will be necessary:
library(rpart) #regression tree
library(rpart.plot) #tree plots
library(randomForest) #bagging and random forecast
Moreover, we define a new function that will be used later for normalizing the data. Given a vector of numerical values, normalization consists in computing for each value the following quantity: \[ \frac{x-\min(x)}{\max(x)-\min(x)} \] ] The resulting new vector will contain values included between 0 and 1. Normalization is basically a linear transformation of the values and is useful when working with variables with completely different ranges (after the normalization the range will be [0,1] for all the variables). The function definition follows:
= function(x){
normalize -min(x))/(max(x)-min(x))
(x }
In order to try the new function named normalize
we simulate a vector of values from the Normal(10,4) distribution and then normalize them:
= rnorm(10, mean=10, sd=4)
x x
## [1] 13.474163 16.624712 15.782544 12.978402 7.239239 6.834353 8.951675 8.368332
## [9] 10.805244 7.072210
range(x)
## [1] 6.834353 16.624712
normalize(x)
## [1] 0.67819879 1.00000000 0.91397988 0.62756114 0.04135560 0.00000000 0.21626605
## [8] 0.15668257 0.40559200 0.02429504
range(normalize(x)) #now it's [0,1]
## [1] 0 1
5.2.1 Data import and manipulation
We import the data:
<- read.csv("./files/MSFT_data.csv")
MSFT head(MSFT)
## Date Adj.Close Close High Low Open Volume
## 1 2017-04-26 63.31828 67.83 68.31 67.62 68.08 26190800
## 2 2017-04-27 63.72900 68.27 68.38 67.58 68.15 34971000
## 3 2017-04-28 63.90636 68.46 69.14 67.69 68.91 39548800
## 4 2017-05-01 64.79317 69.41 69.55 68.50 68.68 31954400
## 5 2017-05-02 64.69049 69.30 69.71 69.13 69.71 23906100
## 6 2017-05-03 64.48512 69.08 69.38 68.71 69.38 28928000
glimpse(MSFT)
## Rows: 1,259
## Columns: 7
## $ Date <chr> "2017-04-26", "2017-04-27", "2017-04-28", "2017-05-01", "2017-05-02",…
## $ Adj.Close <dbl> 63.31828, 63.72900, 63.90636, 64.79317, 64.69049, 64.48512, 64.23309,…
## $ Close <dbl> 67.83, 68.27, 68.46, 69.41, 69.30, 69.08, 68.81, 69.00, 68.94, 69.04,…
## $ High <dbl> 68.31, 68.38, 69.14, 69.55, 69.71, 69.38, 69.08, 69.03, 69.05, 69.28,…
## $ Low <dbl> 67.62, 67.58, 67.69, 68.50, 69.13, 68.71, 68.64, 68.49, 68.42, 68.68,…
## $ Open <dbl> 68.08, 68.15, 68.91, 68.68, 69.71, 69.38, 69.03, 68.90, 68.97, 68.86,…
## $ Volume <int> 26190800, 34971000, 39548800, 31954400, 23906100, 28928000, 21749400,…
The following variables are available for 1259 days: the date, the adjusted closing price, the closing and opening price, the highest and lowest price during the day and the exchange volume.
Note that the first column contains the dates which are coded as characters (without any temporal meaning). It is thus important to transform these strings of text into “real” dates with a temporal ordering. To do this the function ymd
from the lubridate
package is used (the old column Date
is substituted):
$Date = ymd(MSFT$Date)
MSFTrange(MSFT$Date)
## [1] "2017-04-26" "2022-04-25"
The variable we are interested in is the adjusted close price represented in the following plot:
%>%
MSFT ggplot() +
geom_line(aes(Date,Adj.Close))
It’s a non-stationary time series, with an increasing trend in time and period of extremely high variability.
5.2.2 Preparation of the regressors and of the response variable
For the supervised problem we will use the following regressors:
- 2/7/14/21 days moving average (see 5.1.2) computed using past
Adj.Close
values - 7 days rolling standard deviation computed using past
Adj.Close
values - previous day
Volume
The following code is used for creating in the MSFT
the new regressors’ columns:
$MA2 = zoo::rollmean(MSFT$Adj.Close,
MSFTk = 2, fill = NA, align = "right") %>%
::lag(1)
dplyr$MA7 = zoo::rollmean(MSFT$Adj.Close,
MSFTk = 7, fill = NA, align = "right") %>%
::lag(1)
dplyr$MA14 = zoo::rollmean(MSFT$Adj.Close,
MSFTk = 14, fill = NA, align = "right") %>%
::lag(1)
dplyr$MA21 = zoo::rollmean(MSFT$Adj.Close,
MSFTk = 21, fill = NA, align = "right") %>%
::lag(1)
dplyr
$SD7 = rollapply(data = MSFT$Adj.Close,
MSFTwidth = 7,
align ="right",
FUN = sd,
fill = NA) %>%
::lag(1)
dplyr
$lag1volume = dplyr::lag(MSFT$Volume,1)
MSFT
head(MSFT, 25) #first top 25 rows of the df
## Date Adj.Close Close High Low Open Volume MA2 MA7 MA14
## 1 2017-04-26 63.31828 67.83 68.31 67.62 68.08 26190800 NA NA NA
## 2 2017-04-27 63.72900 68.27 68.38 67.58 68.15 34971000 NA NA NA
## 3 2017-04-28 63.90636 68.46 69.14 67.69 68.91 39548800 63.52364 NA NA
## 4 2017-05-01 64.79317 69.41 69.55 68.50 68.68 31954400 63.81768 NA NA
## 5 2017-05-02 64.69049 69.30 69.71 69.13 69.71 23906100 64.34976 NA NA
## 6 2017-05-03 64.48512 69.08 69.38 68.71 69.38 28928000 64.74183 NA NA
## 7 2017-05-04 64.23309 68.81 69.08 68.64 69.03 21749400 64.58780 NA NA
## 8 2017-05-05 64.41046 69.00 69.03 68.49 68.90 19128800 64.35910 64.16507 NA
## 9 2017-05-08 64.35445 68.94 69.05 68.42 68.97 18566100 64.32178 64.32110 NA
## 10 2017-05-09 64.44778 69.04 69.28 68.68 68.86 22858400 64.38245 64.41045 NA
## 11 2017-05-10 64.69983 69.31 69.56 68.92 68.99 17977800 64.40112 64.48779 NA
## 12 2017-05-11 63.90636 68.46 68.73 68.12 68.36 28789400 64.57381 64.47446 NA
## 13 2017-05-12 63.83168 68.38 68.61 68.04 68.61 18714100 64.30309 64.36244 NA
## 14 2017-05-15 63.87835 68.43 68.48 67.57 68.14 31530300 63.86902 64.26909 NA
## 15 2017-05-16 65.16456 69.41 69.44 68.16 68.23 34956000 63.85502 64.21842 64.19174
## 16 2017-05-17 63.35262 67.48 69.10 67.43 68.89 30548800 64.52146 64.32614 64.32362
## 17 2017-05-18 63.56855 67.71 68.13 67.14 67.40 25201300 64.25859 64.18303 64.29674
## 18 2017-05-19 63.54977 67.69 68.10 67.43 67.50 26961100 63.46059 64.05742 64.27261
## 19 2017-05-22 64.26327 68.45 68.50 67.50 67.89 16237600 63.55916 63.89313 64.18379
## 20 2017-05-23 64.47922 68.68 68.75 68.38 68.72 15425800 63.90652 63.94412 64.15328
## 21 2017-05-24 64.56371 68.77 68.88 68.45 68.87 14593900 64.37124 64.03662 64.15286
## 22 2017-05-25 65.36172 69.62 69.88 68.91 68.97 21854100 64.52146 64.13453 64.17647
## 23 2017-05-26 65.68095 69.96 70.22 69.52 69.80 19827900 64.96272 64.16269 64.24442
## 24 2017-05-30 66.10342 70.41 70.41 69.77 69.79 17072800 65.52134 64.49531 64.33917
## 25 2017-05-31 65.56825 69.84 70.74 69.81 70.53 30436400 65.89218 64.85744 64.45743
## MA21 SD7 lag1volume
## 1 NA NA NA
## 2 NA NA 26190800
## 3 NA NA 34971000
## 4 NA NA 39548800
## 5 NA NA 31954400
## 6 NA NA 23906100
## 7 NA NA 28928000
## 8 NA 0.5403347 21749400
## 9 NA 0.3925384 19128800
## 10 NA 0.2941575 18566100
## 11 NA 0.1934688 22858400
## 12 NA 0.1708033 17977800
## 13 NA 0.2460612 28789400
## 14 NA 0.3079325 18714100
## 15 NA 0.3421353 31530300
## 16 NA 0.4965557 34956000
## 17 NA 0.6168428 30548800
## 18 NA 0.6429121 25201300
## 19 NA 0.5966703 26961100
## 20 NA 0.6130147 16237600
## 21 NA 0.6414197 15425800
## 22 64.17267 0.6651045 14593900
## 23 64.26998 0.7180622 21854100
## 24 64.36293 0.8132354 19827900
## 25 64.46755 0.8923094 17072800
The following code returns the plot containing the original time series together with two rolling mean series and the rolling sd time series:
# Plot adjusted close prices + MA + SD
%>%
MSFT ggplot() +
geom_line(aes(Date,Adj.Close)) +
geom_line(aes(Date,MA2,col="MA2")) +
geom_line(aes(Date,MA21,col="MA21"),lwd=1.2) +
geom_line(aes(Date,SD7,col="SD7"))
## Warning: Removed 2 row(s) containing missing values (geom_path).
## Warning: Removed 21 row(s) containing missing values (geom_path).
## Warning: Removed 7 row(s) containing missing values (geom_path).
The option col="..."
is not used here for setting a specific color (which will be instead chosen automatically by R
) but only for creating easily a legend with the lines names (otherwise we should reshape the data in a long format and more code is needed…). The MA2
time series is very similar to the original price series, while the MA21
series is smoother (it catches the general trend but not the fine details). The SD7
has a different range of values and shows peaks during periods of high volatility.
We finally create a new column for the response variable, represented by the next day price (see the previous simple example):
$target = lead(MSFT$Adj.Close, 1)
MSFThead(MSFT)
## Date Adj.Close Close High Low Open Volume MA2 MA7 MA14 MA21 SD7
## 1 2017-04-26 63.31828 67.83 68.31 67.62 68.08 26190800 NA NA NA NA NA
## 2 2017-04-27 63.72900 68.27 68.38 67.58 68.15 34971000 NA NA NA NA NA
## 3 2017-04-28 63.90636 68.46 69.14 67.69 68.91 39548800 63.52364 NA NA NA NA
## 4 2017-05-01 64.79317 69.41 69.55 68.50 68.68 31954400 63.81768 NA NA NA NA
## 5 2017-05-02 64.69049 69.30 69.71 69.13 69.71 23906100 64.34976 NA NA NA NA
## 6 2017-05-03 64.48512 69.08 69.38 68.71 69.38 28928000 64.74183 NA NA NA NA
## lag1volume target
## 1 NA 63.72900
## 2 26190800 63.90636
## 3 34971000 64.79317
## 4 39548800 64.69049
## 5 31954400 64.48512
## 6 23906100 64.23309
We then remove all the rows containing missing values and select only the useful columns:
= MSFT %>%
MSFT na.omit() %>%
::select(Date, MA2, MA7, MA14, MA21, SD7, lag1volume, target)
dplyr
head(MSFT)
## Date MA2 MA7 MA14 MA21 SD7 lag1volume target
## 22 2017-05-25 64.52146 64.13453 64.17647 64.17267 0.6651045 14593900 65.68095
## 23 2017-05-26 64.96272 64.16269 64.24442 64.26998 0.7180622 21854100 66.10342
## 24 2017-05-30 65.52134 64.49531 64.33917 64.36293 0.8132354 19827900 65.56825
## 25 2017-05-31 65.89218 64.85744 64.45743 64.46755 0.8923094 17072800 65.81235
## 26 2017-06-01 65.83583 65.14579 64.51946 64.50446 0.7059899 30436400 67.37083
## 27 2017-06-02 65.69030 65.36709 64.65560 64.55788 0.6209101 21603600 67.85901
The following code represents graphically the correlation between the considered variables:
library(ggcorrplot)
ggcorrplot(cor(MSFT[,-1]),
hc.order = TRUE,
type = "lower",
lab = TRUE, digits=4)
5.2.3 Training and test set
When we work with time series we want to create the training and test set without breaking the temporal sequence of the data (and the related temporal correlation). For this reason we don’t select training data randomly (as done in Section 2.2, but we select the observations from 2017 to the end of 2021 as training data and the remaining data for the test (i.e. for evaluating the predictive performance of the model). The row selection is done using the information about the year which can be extracted from the full date using the year
function from lubridate
:
= MSFT %>%
MSFTtr filter(year(Date) < 2022)
= MSFT %>%
MSFTte filter(year(Date) >= 2022)
With this approach the percentage of data dedicated to the training is the following:
nrow(MSFTtr)/nrow(MSFT)*100
## [1] 93.77526
The number of test observations is equal to 77.
5.2.4 Data normalization
By using the function normalize
we want to normalize all the columns in the MSFT
dataframe but not Date
. To implement the normalization function column by column but using one single line of code will will use mutate
combined with across
:
= MSFTtr %>% mutate(across(MA2:target,normalize))
MSFTtr = MSFTte %>% mutate(across(MA2:target,normalize)) MSFTte