Introduction

We use forecast library and auxilliary fpp library There are two versions of forecast library: forecast and fpp3 (https://otexts.com/fpp3/). Below we use older one which is simpler.

library("forecast")
library("fpp")

Data input

The data contains monthly number of visitors to Malbork castle (largest castle in the world measured by land area and a UNESCO World Heritage Site; https://en.wikipedia.org/wiki/Malbork_Castle)

Import CSV and convert to time-series:

## Data input
z0 <- read.csv("MZM.csv", dec=".", sep = ';',  header=T, na.string="NA");

## What inside MZM.csv ?
str(z0)
## 'data.frame':    51 obs. of  4 variables:
##  $ data       : Factor w/ 51 levels "2015-01-01","2015-02-01",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ krajowi    : int  3278 4149 4623 13083 68485 66199 105843 95255 28021 13532 ...
##  $ zagraniczni: int  2144 1000 2128 5293 17384 18831 27809 28109 17853 8876 ...
##  $ razem      : num  5422 5149 6751 18376 85869 ...
## or in left/top window RStudio (Environment tab)
##
is.ts(z0)
## [1] FALSE
## convert column `razem' from frame `z' to time-series (object)
## frequency monthly first element = 2015/1
z <-ts(z0$razem, start=c(2015, 1), frequency=12)

## check if z is of TS type
is.ts(z)
## [1] TRUE
## check frequecy/start/end
frequency(z)
## [1] 12
start(z)
## [1] 2015    1
end(z)
## [1] 2019    3

We can convert numeric vectors to multi-dimensional TS object

zz <-ts(matrix(c(z0$razem, z0$krajowi), 
               dimnames = list(NULL, c('razem', 'krajowi')),
               ncol=2, byrow = F), 
        start=c(2015, 1), frequency=12)
is.ts(zz)
## [1] TRUE
## vistors total
## zz[,"razem"]

plot/chart (basic):

plot(z)

or with autoplot (from) ggplot library):

library("ggplot2")

autoplot(z) + ylab ("persons")

autoplot(zz[,c("razem","krajowi")], facets=TRUE) + 
  ylab("")

seasonality plot:

seasonplot(z, year.labels = TRUE, main="TITLE")

## or better
## ?ggseasonplot

ggseasonplot(z ) + 
  ylab("persons") +
ggtitle("Malbork castle visitors")

summary statistics

summary(z)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5149    8219   23770   50312   92062  173380

time-window from 4 month 2016

# ?window
z1 <- window (z, start=c(2016, 4))

Time series decomposition

Trend is a long-term increase or decrease in the data

A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week.

A cycle occurs when the data exhibit rises and falls that are not of a frequency (bussiness cycle)

Additive model:

\[TS = T + S + E\]

or (multiplicative)

\[TS = T \cdot S \cdot E\] In the first model the change is constant In the 2nd model the rate of change is constant.

Autocorrelation

Measures the linear relationship between lagged values of a time series For example, \(r_1\) measures the relationship between \(y_t\) and \(y_{t-1}\) , \(r_2\) measures the relationship between \(y_t\) and $y_{t-2} , and so on.

ggAcf(z)

Acf(z)

White noise

Time series that show no autocorrelation are called white noise:

set.seed(30)
y <- ts(rnorm(50))
autoplot(y) + ggtitle("White noise")

ggAcf(y)

Rule of thumb: for a white noise series, we expect 95% of the spikes in the ACF to lie within \(±2\sqrt{T}\) where \(T\) is the length of the time series. These boundy lines are by default added to the plot.

Residual properties

  • The residuals should be uncorrelated (otherwise they contain additional information so forecasting method is not optimal.)

  • The residuals have zero mean (no systematic error)

  • The residuals have constant variance

  • The residuals are normally distributed

res <- residuals(naive(z))

## plot (res, main='tytuł-wykresu', xlab='ośX', ylab='ośY')
## or
autoplot(res) + xlab("Day") + ylab("") +
  ggtitle("Residuals from naïve method")

## visual estimation of residuals
## hist(res)
gghistogram(res) + ggtitle("Histogram of residuals")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

## visual estimation of autocorrelation
## Acf(res, main='...') or
ggAcf(res) + ggtitle("ACF of residuals")

## Box-Ljung test for AC, (high-values = no-AC):

Box.test(res, type='Ljung-Box')
## 
##  Box-Ljung test
## 
## data:  res
## X-squared = 3.4509, df = 1, p-value = 0.06322

Evaluating forecast accuracy

The accuracy of forecasts can only be determined by considering how well a model performs on new data that were not used when fitting the model.

It is common practice to separate the available data into two portions, training set and test set, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy.

The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required.

## training set
z1 <- window (z, end=c(2018, 12))
## test set
z2 <- window (z, start=c(2019, 1))
  • error is a difference between observed (\(y_i\)) and forecast \(\hat y_i\), czyli \(e_i = y_i - \hat y_i\) (BTW residual is a difference between observed and fitted value, ie. concerns training set)

  • MSE – mean square error (of residuals)

  • RMSE – root mean square error

  • \(p_i = e_i/y_i \cdot 100\)

  • MAPE ie (\(\sum p_i\))

Model evaluation

Forecasting methods

Very simple methods

Used as benchmark

h <- 4 ## one quarter ahead
meanf(z1, h)
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2019       52795.92 -17918.33 123510.2 -56648.61 162240.4
## Feb 2019       52795.92 -17918.33 123510.2 -56648.61 162240.4
## Mar 2019       52795.92 -17918.33 123510.2 -56648.61 162240.4
## Apr 2019       52795.92 -17918.33 123510.2 -56648.61 162240.4

Naïve method \(\hat y_{T + h|T} = y_T\) (\(y_T\) is a last observation)

##naive(y, h) or
rwf(z1, h) # Equivalent alternative (random walk forecast)
##          Point Forecast     Lo 80     Hi 80      Lo 95     Hi 95
## Jan 2019           8853 -40850.13  58556.13  -67161.38  84867.38
## Feb 2019           8853 -61437.85  79143.85  -98647.57 116353.57
## Mar 2019           8853 -77235.35  94941.35 -122807.77 140513.77
## Apr 2019           8853 -90553.27 108259.27 -143175.77 160881.77
rwf(z1, h, drift=TRUE)
##          Point Forecast     Lo 80     Hi 80      Lo 95     Hi 95
## Jan 2019           8926 -41314.39  59166.39  -67910.05  85762.05
## Feb 2019           8999 -62803.52  80801.52 -100813.48 118811.48
## Mar 2019           9072 -79779.09  97923.09 -126814.01 144958.01
## Apr 2019           9145 -94493.01 112783.01 -149355.67 167645.67
autoplot(z1) +
  autolayer(meanf(z1, h), series="Mean", PI=FALSE) +
  autolayer(naive(z1, h), series="Naïve", PI=FALSE) +
  autolayer(rwf(z1, h, drift=T), series="Drift", PI=FALSE) +
ggtitle("Visitors in Malbork castle") +
xlab("Year") + ylab("Persons") +
guides(colour=guide_legend(title="Forecast"))

Linear regression model

Zjawisko ma charakter liniowy. Prognozowanie polega na wyznaczeniu prostej najlepiej dopasowanej do danych. Jeżeli występuje trend zakłada się że amplituda wahań sezonowych jest stała (co do wartości bezwględnych (addytywne)lub względnych (multiplikatywne))

## trend 
fit.lm <- tslm(z1 ~ trend )
## trend + sesonality
fit.lm.s <- tslm(z1 ~ trend + season )

## then
summary(fit.lm)
## 
## Call:
## tslm(formula = z1 ~ trend)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55965 -42019 -26632  43631 110608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  40261.8    15818.6   2.545   0.0143 *
## trend          511.6      562.0   0.910   0.3674  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53940 on 46 degrees of freedom
## Multiple R-squared:  0.01769,    Adjusted R-squared:  -0.003661 
## F-statistic: 0.8286 on 1 and 46 DF,  p-value: 0.3674
## or
summary(fit.lm.s)
## 
## Call:
## tslm(formula = z1 ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17166.5  -3555.1   -804.7   2969.8  17326.5 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  -1917.44    3603.43  -0.532             0.598008    
## trend          431.19      71.77   6.008      0.0000007531832 ***
## season2        212.56    4717.61   0.045             0.964319    
## season3       1138.86    4719.25   0.241             0.810713    
## season4      18690.42    4721.98   3.958             0.000352 ***
## season5      90072.47    4725.79  19.060 < 0.0000000000000002 ***
## season6      89731.53    4730.70  18.968 < 0.0000000000000002 ***
## season7     135224.83    4736.68  28.548 < 0.0000000000000002 ***
## season8     138998.39    4743.74  29.301 < 0.0000000000000002 ***
## season9      46195.44    4751.88   9.722      0.0000000000177 ***
## season10     14487.50    4761.08   3.043             0.004424 ** 
## season11     -1804.44    4771.35  -0.378             0.707579    
## season12     -3158.39    4782.67  -0.660             0.513332    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6671 on 35 degrees of freedom
## Multiple R-squared:  0.9886, Adjusted R-squared:  0.9847 
## F-statistic: 252.3 on 12 and 35 DF,  p-value: < 0.00000000000000022
# Compute forecast for next 3 months
h3f <- forecast(fit.lm.s, h=3)
#
# Evaluating accuracy
accuracy(h3f, z2)
##                                    ME     RMSE      MAE         MPE      MAPE
## Training set     0.000000000000198952 5696.399 4446.594    3.461313  28.12733
## Test set     -9518.749999999983629095 9739.500 9518.750 -102.567973 102.56797
##                   MASE         ACF1 Theil's U
## Training set 0.7640013  0.564079180        NA
## Test set     1.6354850 -0.006781125  2.651043

Simple exponential smoothing

This method is suitable for forecasting data with no clear trend or seasonal pattern

## ses
##fit <- ses(x, alpha=a, initial='simple', h=3) or
fit.ses <- ses(z1, h )

accuracy(fit.ses, z2)
##                     ME      RMSE       MAE        MPE     MAPE      MASE
## Training set -923.6834 38992.710 25311.282 -43.543623 78.55966 4.3489137
## Test set     1720.7667  3366.006  2738.256   9.775192 23.66203 0.4704794
##                     ACF1 Theil's U
## Training set  0.24747126        NA
## Test set     -0.01127277  1.007237

Holt’s linear trend method

#fit.holt <- holt(z1, alpha=a, beta=b, initial='simple', h=3)
fit.holt <- holt(z1, h=3)

accuracy(fit.holt, z2)
##                     ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -6240.598 39104.908 26056.405 -67.35502 90.51596 4.4769387
## Test set     -4724.314  4745.205  4724.314 -48.73153 48.73153 0.8117185
##                    ACF1 Theil's U
## Training set  0.2510288        NA
## Test set     -0.4769031  1.461457

Holt-Winters’ seasonal method

fit.hws.a <- hw(z1, seasonal='additive', h=3)
fit.hws.m <- hw(z1, seasonal='multiplicative', h=3)

accuracy(fit.hws.a, z2)
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -217.8371 5546.619 3969.623  1.881847 16.53846 0.6820496
## Test set     3209.9252 3731.442 3209.925 28.021007 28.02101 0.5515204
##                      ACF1 Theil's U
## Training set  0.005457713        NA
## Test set     -0.094020171  1.159493
accuracy(fit.hws.m, z2)
##                       ME     RMSE      MAE       MPE      MAPE      MASE
## Training set    2.686122 2576.974 1819.720 -1.007445  8.384407 0.3126591
## Test set     1962.967892 2417.947 1962.968 16.179856 16.179856 0.3372717
##                      ACF1 Theil's U
## Training set -0.047964765        NA
## Test set     -0.003249461 0.7924484

ets function

## albo
## Funkcja automatycznie wybierająca
## najlepszy wariant ES
fit.ets <- ets(z1)

summary(fit.ets)
## ETS(M,A,M) 
## 
## Call:
##  ets(y = z1) 
## 
##   Smoothing parameters:
##     alpha = 0.0001 
##     beta  = 0.0001 
##     gamma = 0.0001 
## 
##   Initial states:
##     l = 40790.0234 
##     b = 497.6387 
##     s = 0.1442 0.1586 0.4533 1.057 2.7882 2.7738
##            1.8596 1.8354 0.5064 0.1606 0.1386 0.1242
## 
##   sigma:  0.1112
## 
##       AIC      AICc       BIC 
##  970.4035  990.8035 1002.2139 
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set -356.0641 3829.076 2398.424 -1.400258 6.955449 0.4120905 0.3777959
##plot(fit)
##summary(fit)
##checkresiduals(fit)
h3e <- forecast(fit.ets, h=3)
#
# Evaluating accuracy
accuracy(h3e, z2)
##                     ME     RMSE      MAE       MPE      MAPE      MASE
## Training set -356.0641 3829.076 2398.424 -1.400258  6.955449 0.4120905
## Test set     1308.0935 2267.240 1814.778  8.369033 15.284337 0.3118101
##                     ACF1 Theil's U
## Training set  0.37779587        NA
## Test set     -0.01241363  0.694391

ARIMA

Stationarity

Exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.

A stationary time series is one whose properties do not depend on the time at which the series is observed (informal)

Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will a ect the value of the time series at di erent times. On the other hand, a white noise series is stationary

A stationary time series will have no predictable patterns in the long- term (informal).

Differencing

Computing the differences between consecutive observations. Resulting series may be stationary

Logarithm transformation can help to stabilise the variance of a time series. Differencing can stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.

The differenced series is the change between consecutive observations in the original series, and can be written as

\[y't = y_{t} - y_{t-1} \]

When the differenced series is white noise, the model for the original series can be written as

\[y_t - y_{t-1} = ε_t\]

where \(ε_t\) denotes white noise. Rearranging this leads to the “random walk” model

\[y_{t} = y_{t-1} + ε_t\]

The forecasts from a random walk model are equal to the last observation

Second-order differencing and Seasonal differencing

## 
lag <-12
lagged_x <- diff(z, lag)

Unit root tests

KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test

null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false.

unitroot_kpss() (fpp3)

Autoregressive models

In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable:

\[y_{t} = c + φ_1 y_{t-1} + φ_2 y_{t-2} + \cdots + φ_p y_{t-p} + ε_t\]

where \(ε_t\) is white noise. This is like a multiple regression but with lagged values of
\(y_t\) as predictors. We refer to this as an \(AR(p)\) model, an autoregressive model of order \(p\).

Autoregressive models

In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable:

\[y_{t} = c + φ_1 y_{t-1} + φ_2 y_{t-2} + \cdots + φ_p y_{t-p} + ε_t\]

where \(ε_t\) is white noise. This is like a multiple regression but with lagged values of
\(y_t\) as predictors. We refer to this as an \(AR(p)\) model, an autoregressive model of order \(p\).

Moving average models

\[y_{t} = c + θ_1 e_{t-1} + θ_2 e_{t-2} + \cdots + θ_q e_{t-q}\]

where \(ε_t\) is white noise. We refer to this as an \(MA(q)\) model, a moving average model of order \(q\)

Non-seasonal ARIMA models

\[y′_t = c + φ_1 y′_{t−1} + ⋯ + φ_p y′_{t−p} + θ_1 ε_{t−1} + ⋯ + θ_q ε_{t−q} + ε_t\]

\(y_t′\) is the differenced series; \(p\) = order of the AR part; \(d\) = degree of first differencing; \(q\) = order of the MA part.

## oszacowanie modelu w R
## fit <- Arima(z, order=c(p, d, q) )
## czyli Arima(x, c(0,0,q)) = MA(q)
## czyli Arima(x, c(p,0,0)) = AR(p)
##
## or auto.arima 
fit.arima <- auto.arima(z1)

summary(fit.arima)
## Series: z1 
## ARIMA(1,0,0)(0,1,0)[12] with drift 
## 
## Coefficients:
##          ar1     drift
##       0.5467  422.5063
## s.e.  0.1395  171.2054
## 
## sigma^2 estimated as 34891754:  log likelihood=-362.85
## AIC=731.7   AICc=732.45   BIC=736.45
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE       ACF1
## Training set 84.22554 4971.418 2962.866 -9.245848 13.1645 0.5090713 0.02048732
## 
h3a <- forecast(fit.arima, h=3)

accuracy(h3a, z2)
##                       ME     RMSE      MAE        MPE     MAPE      MASE
## Training set    84.22554 4971.418 2962.866  -9.245848 13.16450 0.5090713
## Test set     -1010.31114 1426.117 1203.994 -13.008792 14.35822 0.2068670
##                      ACF1 Theil's U
## Training set  0.020487317        NA
## Test set     -0.007395855 0.2838519