We use forecast
library and auxilliary fpp
library There are two versions of forecast library: forecast
and fpp3
(https://otexts.com/fpp3/). Below we use older one which is simpler.
library("forecast")
library("fpp")
The data contains monthly number of visitors to Malbork castle (largest castle in the world measured by land area and a UNESCO World Heritage Site; https://en.wikipedia.org/wiki/Malbork_Castle)
Import CSV and convert to time-series:
## Data input
z0 <- read.csv("MZM.csv", dec=".", sep = ';', header=T, na.string="NA");
## What inside MZM.csv ?
str(z0)
## 'data.frame': 51 obs. of 4 variables:
## $ data : Factor w/ 51 levels "2015-01-01","2015-02-01",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ krajowi : int 3278 4149 4623 13083 68485 66199 105843 95255 28021 13532 ...
## $ zagraniczni: int 2144 1000 2128 5293 17384 18831 27809 28109 17853 8876 ...
## $ razem : num 5422 5149 6751 18376 85869 ...
## or in left/top window RStudio (Environment tab)
##
is.ts(z0)
## [1] FALSE
## convert column `razem' from frame `z' to time-series (object)
## frequency monthly first element = 2015/1
z <-ts(z0$razem, start=c(2015, 1), frequency=12)
## check if z is of TS type
is.ts(z)
## [1] TRUE
## check frequecy/start/end
frequency(z)
## [1] 12
start(z)
## [1] 2015 1
end(z)
## [1] 2019 3
We can convert numeric vectors to multi-dimensional TS object
zz <-ts(matrix(c(z0$razem, z0$krajowi),
dimnames = list(NULL, c('razem', 'krajowi')),
ncol=2, byrow = F),
start=c(2015, 1), frequency=12)
is.ts(zz)
## [1] TRUE
## vistors total
## zz[,"razem"]
plot(z)
or with autoplot (from) ggplot library):
library("ggplot2")
autoplot(z) + ylab ("persons")
autoplot(zz[,c("razem","krajowi")], facets=TRUE) +
ylab("")
seasonplot(z, year.labels = TRUE, main="TITLE")
## or better
## ?ggseasonplot
ggseasonplot(z ) +
ylab("persons") +
ggtitle("Malbork castle visitors")
summary(z)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5149 8219 23770 50312 92062 173380
# ?window
z1 <- window (z, start=c(2016, 4))
Trend is a long-term increase or decrease in the data
A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week.
A cycle occurs when the data exhibit rises and falls that are not of a frequency (bussiness cycle)
Additive model:
\[TS = T + S + E\]
or (multiplicative)
\[TS = T \cdot S \cdot E\] In the first model the change is constant In the 2nd model the rate of change is constant.
Measures the linear relationship between lagged values of a time series For example, \(r_1\) measures the relationship between \(y_t\) and \(y_{t-1}\) , \(r_2\) measures the relationship between \(y_t\) and $y_{t-2} , and so on.
ggAcf(z)
Acf(z)
Time series that show no autocorrelation are called white noise:
set.seed(30)
y <- ts(rnorm(50))
autoplot(y) + ggtitle("White noise")
ggAcf(y)
Rule of thumb: for a white noise series, we expect 95% of the spikes in the ACF to lie within \(±2\sqrt{T}\) where \(T\) is the length of the time series. These boundy lines are by default added to the plot.
The residuals should be uncorrelated (otherwise they contain additional information so forecasting method is not optimal.)
The residuals have zero mean (no systematic error)
The residuals have constant variance
The residuals are normally distributed
res <- residuals(naive(z))
## plot (res, main='tytuł-wykresu', xlab='ośX', ylab='ośY')
## or
autoplot(res) + xlab("Day") + ylab("") +
ggtitle("Residuals from naïve method")
## visual estimation of residuals
## hist(res)
gghistogram(res) + ggtitle("Histogram of residuals")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## visual estimation of autocorrelation
## Acf(res, main='...') or
ggAcf(res) + ggtitle("ACF of residuals")
## Box-Ljung test for AC, (high-values = no-AC):
Box.test(res, type='Ljung-Box')
##
## Box-Ljung test
##
## data: res
## X-squared = 3.4509, df = 1, p-value = 0.06322
The accuracy of forecasts can only be determined by considering how well a model performs on new data that were not used when fitting the model.
It is common practice to separate the available data into two portions, training set and test set, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy.
The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required.
## training set
z1 <- window (z, end=c(2018, 12))
## test set
z2 <- window (z, start=c(2019, 1))
error is a difference between observed (\(y_i\)) and forecast \(\hat y_i\), czyli \(e_i = y_i - \hat y_i\) (BTW residual is a difference between observed and fitted value, ie. concerns training set)
MSE – mean square error (of residuals)
RMSE – root mean square error
\(p_i = e_i/y_i \cdot 100\)
MAPE ie (\(\sum p_i\))
divide dataset into test set and training set
estimate several model
choose model with best fit
recompute with test set
Used as benchmark
h <- 4 ## one quarter ahead
meanf(z1, h)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2019 52795.92 -17918.33 123510.2 -56648.61 162240.4
## Feb 2019 52795.92 -17918.33 123510.2 -56648.61 162240.4
## Mar 2019 52795.92 -17918.33 123510.2 -56648.61 162240.4
## Apr 2019 52795.92 -17918.33 123510.2 -56648.61 162240.4
Naïve method \(\hat y_{T + h|T} = y_T\) (\(y_T\) is a last observation)
##naive(y, h) or
rwf(z1, h) # Equivalent alternative (random walk forecast)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2019 8853 -40850.13 58556.13 -67161.38 84867.38
## Feb 2019 8853 -61437.85 79143.85 -98647.57 116353.57
## Mar 2019 8853 -77235.35 94941.35 -122807.77 140513.77
## Apr 2019 8853 -90553.27 108259.27 -143175.77 160881.77
Seasonal naïve method
Drift (last observation + average increase/decrease called drift)
rwf(z1, h, drift=TRUE)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2019 8926 -41314.39 59166.39 -67910.05 85762.05
## Feb 2019 8999 -62803.52 80801.52 -100813.48 118811.48
## Mar 2019 9072 -79779.09 97923.09 -126814.01 144958.01
## Apr 2019 9145 -94493.01 112783.01 -149355.67 167645.67
autoplot(z1) +
autolayer(meanf(z1, h), series="Mean", PI=FALSE) +
autolayer(naive(z1, h), series="Naïve", PI=FALSE) +
autolayer(rwf(z1, h, drift=T), series="Drift", PI=FALSE) +
ggtitle("Visitors in Malbork castle") +
xlab("Year") + ylab("Persons") +
guides(colour=guide_legend(title="Forecast"))
Zjawisko ma charakter liniowy. Prognozowanie polega na wyznaczeniu prostej najlepiej dopasowanej do danych. Jeżeli występuje trend zakłada się że amplituda wahań sezonowych jest stała (co do wartości bezwględnych (addytywne)lub względnych (multiplikatywne))
## trend
fit.lm <- tslm(z1 ~ trend )
## trend + sesonality
fit.lm.s <- tslm(z1 ~ trend + season )
## then
summary(fit.lm)
##
## Call:
## tslm(formula = z1 ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55965 -42019 -26632 43631 110608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40261.8 15818.6 2.545 0.0143 *
## trend 511.6 562.0 0.910 0.3674
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53940 on 46 degrees of freedom
## Multiple R-squared: 0.01769, Adjusted R-squared: -0.003661
## F-statistic: 0.8286 on 1 and 46 DF, p-value: 0.3674
## or
summary(fit.lm.s)
##
## Call:
## tslm(formula = z1 ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17166.5 -3555.1 -804.7 2969.8 17326.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1917.44 3603.43 -0.532 0.598008
## trend 431.19 71.77 6.008 0.0000007531832 ***
## season2 212.56 4717.61 0.045 0.964319
## season3 1138.86 4719.25 0.241 0.810713
## season4 18690.42 4721.98 3.958 0.000352 ***
## season5 90072.47 4725.79 19.060 < 0.0000000000000002 ***
## season6 89731.53 4730.70 18.968 < 0.0000000000000002 ***
## season7 135224.83 4736.68 28.548 < 0.0000000000000002 ***
## season8 138998.39 4743.74 29.301 < 0.0000000000000002 ***
## season9 46195.44 4751.88 9.722 0.0000000000177 ***
## season10 14487.50 4761.08 3.043 0.004424 **
## season11 -1804.44 4771.35 -0.378 0.707579
## season12 -3158.39 4782.67 -0.660 0.513332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6671 on 35 degrees of freedom
## Multiple R-squared: 0.9886, Adjusted R-squared: 0.9847
## F-statistic: 252.3 on 12 and 35 DF, p-value: < 0.00000000000000022
# Compute forecast for next 3 months
h3f <- forecast(fit.lm.s, h=3)
#
# Evaluating accuracy
accuracy(h3f, z2)
## ME RMSE MAE MPE MAPE
## Training set 0.000000000000198952 5696.399 4446.594 3.461313 28.12733
## Test set -9518.749999999983629095 9739.500 9518.750 -102.567973 102.56797
## MASE ACF1 Theil's U
## Training set 0.7640013 0.564079180 NA
## Test set 1.6354850 -0.006781125 2.651043
This method is suitable for forecasting data with no clear trend or seasonal pattern
## ses
##fit <- ses(x, alpha=a, initial='simple', h=3) or
fit.ses <- ses(z1, h )
accuracy(fit.ses, z2)
## ME RMSE MAE MPE MAPE MASE
## Training set -923.6834 38992.710 25311.282 -43.543623 78.55966 4.3489137
## Test set 1720.7667 3366.006 2738.256 9.775192 23.66203 0.4704794
## ACF1 Theil's U
## Training set 0.24747126 NA
## Test set -0.01127277 1.007237
#fit.holt <- holt(z1, alpha=a, beta=b, initial='simple', h=3)
fit.holt <- holt(z1, h=3)
accuracy(fit.holt, z2)
## ME RMSE MAE MPE MAPE MASE
## Training set -6240.598 39104.908 26056.405 -67.35502 90.51596 4.4769387
## Test set -4724.314 4745.205 4724.314 -48.73153 48.73153 0.8117185
## ACF1 Theil's U
## Training set 0.2510288 NA
## Test set -0.4769031 1.461457
fit.hws.a <- hw(z1, seasonal='additive', h=3)
fit.hws.m <- hw(z1, seasonal='multiplicative', h=3)
accuracy(fit.hws.a, z2)
## ME RMSE MAE MPE MAPE MASE
## Training set -217.8371 5546.619 3969.623 1.881847 16.53846 0.6820496
## Test set 3209.9252 3731.442 3209.925 28.021007 28.02101 0.5515204
## ACF1 Theil's U
## Training set 0.005457713 NA
## Test set -0.094020171 1.159493
accuracy(fit.hws.m, z2)
## ME RMSE MAE MPE MAPE MASE
## Training set 2.686122 2576.974 1819.720 -1.007445 8.384407 0.3126591
## Test set 1962.967892 2417.947 1962.968 16.179856 16.179856 0.3372717
## ACF1 Theil's U
## Training set -0.047964765 NA
## Test set -0.003249461 0.7924484
ets
function## albo
## Funkcja automatycznie wybierająca
## najlepszy wariant ES
fit.ets <- ets(z1)
summary(fit.ets)
## ETS(M,A,M)
##
## Call:
## ets(y = z1)
##
## Smoothing parameters:
## alpha = 0.0001
## beta = 0.0001
## gamma = 0.0001
##
## Initial states:
## l = 40790.0234
## b = 497.6387
## s = 0.1442 0.1586 0.4533 1.057 2.7882 2.7738
## 1.8596 1.8354 0.5064 0.1606 0.1386 0.1242
##
## sigma: 0.1112
##
## AIC AICc BIC
## 970.4035 990.8035 1002.2139
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -356.0641 3829.076 2398.424 -1.400258 6.955449 0.4120905 0.3777959
##plot(fit)
##summary(fit)
##checkresiduals(fit)
h3e <- forecast(fit.ets, h=3)
#
# Evaluating accuracy
accuracy(h3e, z2)
## ME RMSE MAE MPE MAPE MASE
## Training set -356.0641 3829.076 2398.424 -1.400258 6.955449 0.4120905
## Test set 1308.0935 2267.240 1814.778 8.369033 15.284337 0.3118101
## ACF1 Theil's U
## Training set 0.37779587 NA
## Test set -0.01241363 0.694391
Exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.
A stationary time series is one whose properties do not depend on the time at which the series is observed (informal)
Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will a ect the value of the time series at di erent times. On the other hand, a white noise series is stationary
A stationary time series will have no predictable patterns in the long- term (informal).
Computing the differences between consecutive observations. Resulting series may be stationary
Logarithm transformation can help to stabilise the variance of a time series. Differencing can stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.
The differenced series is the change between consecutive observations in the original series, and can be written as
\[y't = y_{t} - y_{t-1} \]
When the differenced series is white noise, the model for the original series can be written as
\[y_t - y_{t-1} = ε_t\]
where \(ε_t\) denotes white noise. Rearranging this leads to the “random walk” model
\[y_{t} = y_{t-1} + ε_t\]
The forecasts from a random walk model are equal to the last observation
Second-order differencing and Seasonal differencing
##
lag <-12
lagged_x <- diff(z, lag)
KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test
null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false.
unitroot_kpss() (fpp3)
In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable:
\[y_{t} = c + φ_1 y_{t-1} + φ_2 y_{t-2} + \cdots + φ_p y_{t-p} + ε_t\]
where \(ε_t\) is white noise. This is like a multiple regression but with lagged values of
\(y_t\) as predictors. We refer to this as an \(AR(p)\) model, an autoregressive model of order \(p\).
In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable:
\[y_{t} = c + φ_1 y_{t-1} + φ_2 y_{t-2} + \cdots + φ_p y_{t-p} + ε_t\]
where \(ε_t\) is white noise. This is like a multiple regression but with lagged values of
\(y_t\) as predictors. We refer to this as an \(AR(p)\) model, an autoregressive model of order \(p\).
\[y_{t} = c + θ_1 e_{t-1} + θ_2 e_{t-2} + \cdots + θ_q e_{t-q}\]
where \(ε_t\) is white noise. We refer to this as an \(MA(q)\) model, a moving average model of order \(q\)
\[y′_t = c + φ_1 y′_{t−1} + ⋯ + φ_p y′_{t−p} + θ_1 ε_{t−1} + ⋯ + θ_q ε_{t−q} + ε_t\]
\(y_t′\) is the differenced series; \(p\) = order of the AR part; \(d\) = degree of first differencing; \(q\) = order of the MA part.
## oszacowanie modelu w R
## fit <- Arima(z, order=c(p, d, q) )
## czyli Arima(x, c(0,0,q)) = MA(q)
## czyli Arima(x, c(p,0,0)) = AR(p)
##
## or auto.arima
fit.arima <- auto.arima(z1)
summary(fit.arima)
## Series: z1
## ARIMA(1,0,0)(0,1,0)[12] with drift
##
## Coefficients:
## ar1 drift
## 0.5467 422.5063
## s.e. 0.1395 171.2054
##
## sigma^2 estimated as 34891754: log likelihood=-362.85
## AIC=731.7 AICc=732.45 BIC=736.45
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 84.22554 4971.418 2962.866 -9.245848 13.1645 0.5090713 0.02048732
##
h3a <- forecast(fit.arima, h=3)
accuracy(h3a, z2)
## ME RMSE MAE MPE MAPE MASE
## Training set 84.22554 4971.418 2962.866 -9.245848 13.16450 0.5090713
## Test set -1010.31114 1426.117 1203.994 -13.008792 14.35822 0.2068670
## ACF1 Theil's U
## Training set 0.020487317 NA
## Test set -0.007395855 0.2838519