Chapter 12 Lab 10 - 14/01/2021
In this lecture we will study the temporal structure of some time series and then estimate ARIMA models. For this lab we will use two files of data: logret.csv, already introduced for Lab. 9, and GNP.csv.
12.1 Study the temporal correlation of a time series
We first import the data available in the logret.csv file:
<- read.csv("files/logret.csv")
logret head(logret)
## BA CAT CSCO NKE DIS
## 1 -0.026622040 -0.012790378 0.0009849775 -0.0106301541 -0.0041285192
## 2 0.034863602 0.004669870 -0.0004920804 0.0009536113 0.0003940198
## 3 0.004289767 0.006754197 0.0068726044 0.0019048651 0.0001968674
## 4 -0.025354422 0.001156074 0.0019554610 0.0104117238 -0.0041431171
## 5 0.018325151 -0.005900466 0.0236439094 0.0016935999 0.0001974132
## 6 0.005081518 0.010930122 0.0004765842 0.0084245628 0.0098349702
## DJI
## 1 -0.004150804
## 2 0.004615400
## 3 0.006009308
## 4 0.001276720
## 5 0.001399526
## 6 0.002038986
str(logret)
## 'data.frame': 1258 obs. of 6 variables:
## $ BA : num -0.02662 0.03486 0.00429 -0.02535 0.01833 ...
## $ CAT : num -0.01279 0.00467 0.00675 0.00116 -0.0059 ...
## $ CSCO: num 0.000985 -0.000492 0.006873 0.001955 0.023644 ...
## $ NKE : num -0.01063 0.000954 0.001905 0.010412 0.001694 ...
## $ DIS : num -0.004129 0.000394 0.000197 -0.004143 0.000197 ...
## $ DJI : num -0.00415 0.00462 0.00601 0.00128 0.0014 ...
We focus on the BA
log-returns whose time series is showing stationarity and mean reverting:
ts.plot(logret$BA)
The aim is to study the temporal correlation and the auto-dependence in time. We will make use of the function lag.plot
which plots a time-series against lagged versions of themselves. This concept is also explained in Figure 12.1 which refers to a simple case with \(n=6\) observations. It is important to note that when the lag \(h\) increases the number of observations (couples) used to compute the auto-correlation decreases. Thus, the maximum value of \(h\) that can be considered is \(n-1\). However, this is not a problem for us as we usually work with very long time series.
The following code produces the scatterplot of the couples with distance (lag) equal to one day (i.e. \(y_t\) vs \(y_{t-1}\) for all \(t=2,3, \ldots,\) 1258).
lag.plot(logret$BA, set.lags = 1) #h=1
The dashed line in the plot is the bisector. We observe that the points are spread all across the plot denoting a weak correlation. The same plot can be obtained for \(h=2\) (for plotting \(y_t\) vs \(y_{t-2}\) for all \(t=3, \ldots,\) 1258) ), 500 and 1000.
lag.plot(logret$BA, set.lags = 2) #h=2
lag.plot(logret$BA, set.lags = 500) #h=500
lag.plot(logret$BA, set.lags = 1000) #h=1000
We note that for the considered lags the correlation is always weak. It is worth to note that the number of points is smaller for high values of \(h\). It is also possible to consider more lags jointly:
lag.plot(logret$BA, set.lags = 1:10)
All the 10 plots show a very weak (auto)-correlation. To quantify it we can compute the values of the auto-correlation function by using the acf
function:
acf(logret$BA, plot=FALSE)
##
## Autocorrelations of series 'logret$BA', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 0.033 -0.032 0.006 0.005 -0.051 -0.008 0.012 0.019 0.026 -0.046
## 11 12 13 14 15 16 17 18 19 20 21
## 0.022 0.017 -0.012 -0.008 -0.031 0.037 0.013 0.005 -0.010 0.022 0.003
## 22 23 24 25 26 27 28 29 30
## 0.025 0.015 0.048 0.032 -0.002 -0.002 -0.016 0.024 0.013
It can be observed that all the values of the auto-correlation, for lags between 1 and 30. The considered maximum lag is given by \(10 \times log 10 \times n\) where \(n\) is the time series length (this can be changed with the lag.max
argument).
The plot of the auto-correlation function, known as correlogram, can be obtained by using the same function acf
with the option plot=TRUE
. As the latter is the default specification we will omit it.
acf(logret$BA)
All the vertical bars are contained between the blue lines. This means that there are no significant correlations. These data could have been generated by a white-noise with uncorrelated components. Can we conclude that we have also independent components (remember that in general uncorrelation does not imply independence)? To check this we take into account a transformation of the data (e.g. square or absolute value transformation).
acf(logret$BA^2)
acf(abs(logret$BA))
We see some significant correlations and so we conclude that we don’t have independence (if the data were independent, we would see no significant correlations in the correlogram of the transformed data).
12.1.1 lag.plot for dependent sequences
Let’s see the kind of lag.plot we obtain for an auto-correlated sequence. In the following chunk I simulate a time series from an AR\((1)\) process.
<- arima.sim(n=1000,model=list(ar=0.6,ma=0)) sim_data
As we expect the sequence is stationary and auto-correlated, as we can deduce from the following plot:
par(mfrow=c(1,2))
ts.plot(sim_data)
acf(sim_data)
Let’s give a look at the lag.plot for lag \(h\) equal to \(1,2,\dots,10\)
lag.plot(sim_data, set.lags = 1:10)
We can notice how the autocorrelation decreases and it disappear starting at lag equal to \(5\).
12.2 Ljung-Box test
We consider now a different time series for the CSCO
asset.
ts.plot(logret$CSCO)
acf(logret$CSCO)
We note that all the correlations except for \(h=12\) are not significant (the bars are inside the blue lines and we do not reject the null hypothesis \(H_0:\rho(h)=0\)). Is the case of \(h=12\) a false positive or a real significant correlation? To check this we can use the simultaneous Ljung-Box test implemented in R by using the function Box.test
(see ?Box.test
), where we set lag=12
because we want to test \(H_0: \rho(1)=\rho(2)=\ldots=\rho(12)=0\). It is also necessary to specify the test we want to implement with type="L"
. In this case the test statistic is distributed as \(\chi^2_{12}\).
Box.test(logret$CSCO,lag=12,type="L")
##
## Box-Ljung test
##
## data: logret$CSCO
## X-squared = 17.969, df = 12, p-value = 0.1166
Given the p-value, we do not reject the null hypothesis and thus we can considered that there is no significant correlation at any lag for the considered data.
12.3 Looking for stationarity
We import now the GNP data. They refer to the series of US Gross National Product (GNP) in Billions of Dollars (seasonally Adjusted Annual Rate, quarterly frequency from January 1947 to July 2017).
<- read.csv("files/GNP.csv")
GNPdf str(GNPdf)
## 'data.frame': 283 obs. of 2 variables:
## $ DATE: chr "1947-01-01" "1947-04-01" "1947-07-01" "1947-10-01" ...
## $ GNP : num 244 247 251 262 268 ...
head(GNPdf)
## DATE GNP
## 1 1947-01-01 244.058
## 2 1947-04-01 247.362
## 3 1947-07-01 251.246
## 4 1947-10-01 261.545
## 5 1948-01-01 267.564
## 6 1948-04-01 274.376
tail(GNPdf)
## DATE GNP
## 278 2016-04-01 18733.02
## 279 2016-07-01 18917.47
## 280 2016-10-01 19134.46
## 281 2017-01-01 19271.96
## 282 2017-04-01 19452.41
## 283 2017-07-01 19722.06
The time series is characterized by an increasing momentum and thus it is not stationary:
ts.plot(GNPdf$GNP)
We thus try to compute the first order difference by using the diff
function:
ts.plot(diff(GNPdf$GNP))
The situation is still non stationary as it seems characterized by different values of mean and variance. We proceed by differencing one more time by applying the diff
function twice:
ts.plot(diff(diff(GNPdf$GNP)))
The time series now seems to revert to zero but the variability increases in time. This can be fixed by applying a transformation of the data that stabilizes the variance. We consider the log transformation which belongs to the Box-Cox family of transformations. Logs are useful because they are nice interpretation: changes in a log value corresponds to relative (percent) changes on the original scale.
We add the transformed data in the dataframe
$logGNP = log(GNPdf$GNP)
GNPdfts.plot(GNPdf$logGNP)
and then compute again differences:
ts.plot(diff(GNPdf$logGNP))
ts.plot(diff(diff(GNPdf$logGNP)))
The time series of the second order differences of log GNP values seems to be mean reverting. We are maybe not so sure about variability which seems higher at the beginning of the period. In this case it is possible to use the Augmented Dickey Fuller test to verify stationarity (link). The function adf.test
is included in the tseries
package. It tests the null hypothesis of a unit root (i.e. non stationarity) vs the alternative hypothesis of stationary or explosive process.
= diff(diff(GNPdf$logGNP))
logGNP2 library(tseries)
adf.test(logGNP2, alternative = "stationary")
## Warning in adf.test(logGNP2, alternative = "stationary"): p-value smaller than
## printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: logGNP2
## Dickey-Fuller = -10.505, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
Given the p-value, we reject H0 and we accept the stationarity hypothesis for the logGNP2
time series. We thus go on with the analysis, by analysing the auto-correlation (acf
) and the partial auto-correlation function (pacf
):
acf(logGNP2)
pacf(logGNP2)
The temporal pattern is not clear neither for the correlogram (is it a cut-off at lag 5?) nor the pacf plot (is it a tail-off with wave pattern?). We will different ARMA models and choose the best one according to residuals analysis and the AIC index value.
12.4 Estimate ARIMA models
We will start by estimating a MA(1) model. We will use the Arima
function which is part of the forecast
package. It is known that the Arima
function adopts maximum likelihood for estimation; thus, the coefficients are asymptotically normally distributed. Hence, if you the divide each coefficient by its standard error you get the corresponding z-statistic value and it is then possible to compute the p-value to test if the parameter is significantly different from zero.
For estimating a model, we have to specify the data and the order \((p,d,q)\) of the model. Also the option include.constant=TRUE
has to be included.
library(forecast)
= Arima(logGNP2, order=c(0,0,1),
mod1 include.constant = TRUE)
summary(mod1)
## Series: logGNP2
## ARIMA(0,0,1) with non-zero mean
##
## Coefficients:
## ma1 mean
## -0.4883 0e+00
## s.e. 0.0833 3e-04
##
## sigma^2 estimated as 0.0001035: log likelihood=891.38
## AIC=-1776.76 AICc=-1776.67 BIC=-1765.84
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.427488e-05 0.01013676 0.00741686 205.2358 351.3944 0.5454505
## ACF1
## Training set 0.05841076
The parameter estimates can be retrieved with
$coef mod1
## ma1 intercept
## -4.883294e-01 -2.360583e-05
The estimated model is given by \(Y_t=\epsilon_t\)-0.488\(\epsilon_{t-1}\) (mean is omitted as it is very small). We are interested in the analysis of the model residuals in order to see if there is still some temporal correlation. We can use the checkresiduals
function which provides some plots for the residuals (including the correlogram) and the Ljung-Box test output.
checkresiduals(mod1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 54.359, df = 8, p-value = 5.881e-09
##
## Model df: 2. Total lags used: 10
Note that in the Ljung-Box test the number of considered lags is 10 (we are checking if the first 10 correlations are equal to zero); if necessary this can be changed by using the lag
argument. Moreover, the number of model degrees of freedom is 2 (given by the number of parameters estimated, \(\mu\) and \(\theta\)) and the number of degrees of freedom of the \(\chi^2\) distribution is 8 (the number of estimated parameters can be set also manually with the df
argument). Remember that if you want to use the Box.test
function for the residuals you have to specify with the argument fitdf
the number of estimated parameters. Given the p-value we reject the simultaneous uncorrelation hypothesis (this is confirmed also by the correlogram). It means that the MA(1) is not enough for catching the temporal structure of the data. Let’s consider the AR(1) model.
= Arima(logGNP2, order=c(1,0,0),
mod2 include.constant = TRUE)
summary(mod2)
## Series: logGNP2
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## -0.3682 0e+00
## s.e. 0.0553 5e-04
##
## sigma^2 estimated as 0.0001056: log likelihood=888.55
## AIC=-1771.09 AICc=-1771.01 BIC=-1760.18
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.368045e-06 0.01024175 0.007489712 198.4469 319.9212 0.5508082
## ACF1
## Training set -0.02013451
checkresiduals(mod2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0) with non-zero mean
## Q* = 36.551, df = 8, p-value = 1.392e-05
##
## Model df: 2. Total lags used: 10
The estimated model is \(Y_t =\)-0.368\(Y_{t-1}+\epsilon_t\).
Also the performance of the AR(1) is poor as there is still some correlation in the residuals. In any case, we can compare mod1
and mod2
by using the AIC:
$aic mod1
## [1] -1776.756
$aic mod2
## [1] -1771.092
We observe that mod1
should be preferred. In the following we will try many other different models: # ARMA(1,1), ARMA(2,1), ARMA(2,2), ARMA(3,1), ARMA(1,3), ARMA(2,3) and ARMA(3,2).
######### ARMA(1,1) #########
= Arima(logGNP2, order=c(1,0,1),
mod3 include.constant = TRUE)
summary(mod3)
## Series: logGNP2
## ARIMA(1,0,1) with non-zero mean
##
## Coefficients:
## ar1 ma1 mean
## 0.4161 -0.9623 0e+00
## s.e. 0.0589 0.0177 1e-04
##
## sigma^2 estimated as 9.006e-05: log likelihood=910.67
## AIC=-1813.33 AICc=-1813.19 BIC=-1798.78
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 4.807754e-06 0.009439138 0.006820991 175.5431 236.4172 0.5016291
## ACF1
## Training set -0.04671616
checkresiduals(mod3)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1) with non-zero mean
## Q* = 28.586, df = 7, p-value = 0.0001721
##
## Model df: 3. Total lags used: 10
######### ARMA(2,1) #########
= Arima(logGNP2, order=c(2,0,1),
mod4 include.constant = TRUE)
summary(mod4)
## Series: logGNP2
## ARIMA(2,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 mean
## 0.3780 0.1184 -0.9706 0e+00
## s.e. 0.0618 0.0614 0.0176 1e-04
##
## sigma^2 estimated as 8.919e-05: log likelihood=912.53
## AIC=-1815.06 AICc=-1814.84 BIC=-1796.87
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.931111e-05 0.009376612 0.006751608 207.9673 287.0736 0.4965266
## ACF1
## Training set 0.01841865
checkresiduals(mod4)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,1) with non-zero mean
## Q* = 25.384, df = 6, p-value = 0.0002898
##
## Model df: 4. Total lags used: 10
######### ARMA(2,2) #########
= Arima(logGNP2, order=c(2,0,2),
mod5 include.constant = TRUE)
summary(mod5)
## Series: logGNP2
## ARIMA(2,0,2) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2 mean
## -0.1836 0.3745 -0.4025 -0.5486 0e+00
## s.e. 0.1762 0.0787 0.1798 0.1712 1e-04
##
## sigma^2 estimated as 8.815e-05: log likelihood=914.67
## AIC=-1817.33 AICc=-1817.02 BIC=-1795.5
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.827708e-05 0.009305174 0.006668397 213.8634 291.7492 0.490407
## ACF1
## Training set 0.0150511
checkresiduals(mod5)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2) with non-zero mean
## Q* = 20.823, df = 5, p-value = 0.0008749
##
## Model df: 5. Total lags used: 10
######### ARMA(3,1) #########
= Arima(logGNP2, order=c(3,0,1),
mod6 include.constant = TRUE)
summary(mod6)
## Series: logGNP2
## ARIMA(3,0,1) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 mean
## 0.3852 0.1708 -0.1725 -0.9577 0e+00
## s.e. 0.0606 0.0630 0.0607 0.0191 1e-04
##
## sigma^2 estimated as 8.702e-05: log likelihood=916.46
## AIC=-1820.93 AICc=-1820.62 BIC=-1799.1
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.783105e-05 0.009245019 0.006574007 186.7293 252.9048 0.4834654
## ACF1
## Training set -0.01776276
checkresiduals(mod6)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,1) with non-zero mean
## Q* = 17.037, df = 5, p-value = 0.00443
##
## Model df: 5. Total lags used: 10
######### ARMA(1,3) #########
= Arima(logGNP2, order=c(1,0,3),
mod7 include.constant = TRUE)
summary(mod7)
## Series: logGNP2
## ARIMA(1,0,3) with non-zero mean
##
## Coefficients:
## ar1 ma1 ma2 ma3 mean
## 0.3332 -0.9211 0.1555 -0.1929 0e+00
## s.e. 0.1530 0.1532 0.1304 0.0652 1e-04
##
## sigma^2 estimated as 8.787e-05: log likelihood=915.13
## AIC=-1818.25 AICc=-1817.95 BIC=-1796.42
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.647367e-05 0.009289874 0.006652528 211.2842 281.9028 0.48924
## ACF1
## Training set 0.003535468
checkresiduals(mod7)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,3) with non-zero mean
## Q* = 18.776, df = 5, p-value = 0.002116
##
## Model df: 5. Total lags used: 10
######### ARMA(2,3) #########
= Arima(logGNP2, order=c(2,0,3),
mod8 include.constant = TRUE)
summary(mod8)
## Series: logGNP2
## ARIMA(2,0,3) with non-zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 mean
## -0.1879 0.2820 -0.3862 -0.4622 -0.0948 0e+00
## s.e. 0.2221 0.1359 0.2277 0.2297 0.1023 1e-04
##
## sigma^2 estimated as 8.822e-05: log likelihood=915.07
## AIC=-1816.13 AICc=-1815.72 BIC=-1790.66
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.382653e-05 0.009291752 0.006647268 200.0394 277.0012 0.4888531
## ACF1
## Training set -0.0003539481
checkresiduals(mod8)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,3) with non-zero mean
## Q* = 19.605, df = 4, p-value = 0.0005974
##
## Model df: 6. Total lags used: 10
######### ARMA(3,2) #########
= Arima(logGNP2, order=c(3,0,2),
mod9 include.constant = TRUE)
summary(mod9)
## Series: logGNP2
## ARIMA(3,0,2) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 mean
## 1.0208 -0.0725 -0.2551 -1.6275 0.6548 0e+00
## s.e. 0.1686 0.1105 0.0613 0.1718 0.1652 1e-04
##
## sigma^2 estimated as 8.514e-05: log likelihood=919.95
## AIC=-1825.89 AICc=-1825.48 BIC=-1800.43
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 4.601698e-05 0.009128333 0.006474332 139.5792 223.9337 0.4761351
## ACF1
## Training set -0.003316259
checkresiduals(mod9)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,2) with non-zero mean
## Q* = 9.666, df = 4, p-value = 0.04645
##
## Model df: 6. Total lags used: 10
= c(mod1$aic, mod2$aic, mod3$aic, mod4$aic,mod5$aic, mod6$aic, mod7$aic, mod8$aic,mod9$aic)
all_aic min(all_aic)
## [1] -1825.894
which.min(all_aic)
## [1] 9
The model with the lowest AIC is mod9
(ARMA(3,2)). Moreover, its residuals are uncorrelated, meaning that there is no residual correlation is left. The corresponding estimated model equation is $Y_t=$1.021 \(Y_{t-1}\) -0.073 \(Y_{t-2}\) -0.255 \(Y_{t-3}+\epsilon_t\) -1.628 \(\epsilon_{t-1}\) +0.655 \(\epsilon_{t-2}\).
Once we have chosen the best model we can plot observed and fitted values:
ts.plot(logGNP2, col="grey")
lines(mod9$fitted, col="blue")
We observe that model is not able to pick the extreme values of the series. We proceed now with forecasting, i.e. the prediction of the time series values for the following 10 days. To do this we use the forecast
function:
forecast(mod9)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 282 -3.297003e-03 -0.01512236 0.008528357 -0.02138233 0.01478832
## 283 -4.933976e-04 -0.01432511 0.013338316 -0.02164718 0.02066038
## 284 -1.407692e-03 -0.01524636 0.012430972 -0.02257210 0.01975672
## 285 -5.687808e-04 -0.01471715 0.013579584 -0.02220684 0.02106928
## 286 -3.613323e-04 -0.01455580 0.013833140 -0.02206990 0.02134724
## 287 2.283446e-05 -0.01419656 0.014242231 -0.02172386 0.02176953
## 288 1.859326e-04 -0.01403349 0.014405352 -0.02156079 0.02193266
## 289 2.716382e-04 -0.01395154 0.014494819 -0.02148084 0.02202412
## 290 2.492921e-04 -0.01398453 0.014483118 -0.02151947 0.02201805
## 291 1.786564e-04 -0.01406558 0.014422894 -0.02160602 0.02196334
The output table contains the point forecasts and the extreme of the prediction intervals with \(1-\alpha\) equal to 0.80 and 0.95. The corresponding plot is given by
plot(forecast(mod9))
If you are interested in obtaining more forecasts, you can use use the h
option:
plot(forecast(mod9, h=50))
As expected for stationary models, the \(k\)-step forecast \(\hat y_{n+k}\) converges to the estimated mean \(\hat \mu\) as \(k\rightarrow +\infty\). Moreover, the \(k\)-step forecast error converges to a finite value as \(k\rightarrow +\infty\) and the forecast bounds converge to parallel horizontal lines.
We conclude by estimating for the (non-stationary) original data (logGNP
) the same model that we have identified for the (stationary) second order differences (ARMA(3,2)); it will be an ARMA(3,2,1) model:
= Arima(GNPdf$logGNP, order=c(3,2,2),include.constant = TRUE)
mod9original summary(mod9original)
## Series: GNPdf$logGNP
## ARIMA(3,2,2)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2
## 1.0211 -0.0723 -0.2566 -1.6265 0.6547
## s.e. 0.1679 0.1102 0.0614 0.1712 0.1643
##
## sigma^2 estimated as 8.515e-05: log likelihood=919.79
## AIC=-1827.59 AICc=-1827.28 BIC=-1805.76
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0002598555 0.009113036 0.006455361 -0.002162477 0.08893417
## MASE ACF1
## Training set 0.389448 -0.005570206
plot(forecast(mod9original,h = 50))
The parameter estimates do not change. We have however a different behavior of the prediction intervals. In fact, as expected for nonstationary models, the \(k\)-step forecast error converges to \(+\infty\) as \(k\rightarrow +\infty\) and the forecast bounds diverge away from each other.
12.5 Dynamic regression model
In order to illustrate how to implement a dynamic regression model, also known as ARIMAX, we use the data named usconsumption
available in the fpp
library. They refer to the percentage changes in quarterly personal consumption expenditure and personal disposable income for the US, 1970 to 2010.
library(fpp)
## Loading required package: fma
##
## Attaching package: 'fma'
## The following objects are masked from 'package:faraway':
##
## airpass, eggs, ozone, wheat
## The following objects are masked from 'package:MASS':
##
## cement, housing, petrol
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:timeSeries':
##
## time<-
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
head(usconsumption)
## consumption income
## 1970 Q1 0.6122769 0.496540
## 1970 Q2 0.4549298 1.736460
## 1970 Q3 0.8746730 1.344881
## 1970 Q4 -0.2725144 -0.328146
## 1971 Q1 1.8921870 1.965432
## 1971 Q2 0.9133782 1.490757
tail(usconsumption)
## consumption income
## 2009 Q3 0.5740010 -1.3925562
## 2009 Q4 0.1093289 -0.1447134
## 2010 Q1 0.6710180 1.1871651
## 2010 Q2 0.7177182 1.3543547
## 2010 Q3 0.6531433 0.5611698
## 2010 Q4 0.8753521 0.3710579
class(usconsumption)
## [1] "mts" "ts"
Note that the object usconsumption
is of type Time-Series
. For this reason in order to select a variable (i.e a column) it is not possible to use $
but we have to use square parentheses. Here below we represent graphically the two variables:
#consumption
ts.plot(usconsumption[,1])
#income
ts.plot(usconsumption[,2])
#both the variables together
ts.plot(usconsumption[,1],ylim=range(usconsumption))
lines(usconsumption[,2],col="red")
Both the time series appear to be stationary.
We start by implementing a standard simple regression model where consumption
is the response variable and income
the regressor:
= lm(consumption ~ income, data=usconsumption)
modreg summary(modreg)
##
## Call:
## lm(formula = consumption ~ income, data = usconsumption)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3681 -0.3237 0.0266 0.3436 1.5581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.52062 0.06231 8.356 2.79e-14 ***
## income 0.31866 0.05226 6.098 7.61e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6274 on 162 degrees of freedom
## Multiple R-squared: 0.1867, Adjusted R-squared: 0.1817
## F-statistic: 37.18 on 1 and 162 DF, p-value: 7.614e-09
The slope \(\beta\) is positive and significantly different from zero. We can now proceed with the check of the uncorrelation hypothesis for the regression residuals:
acf(modreg$residuals)
pacf(modreg$residuals)
From the acf and pacf plots we observe that the residuals are correlated; thus the uncorrelation assumpation is not valid. The temporal pattern in the two plots is not clear: is it a cut-off at lag 3 for the acf? And a cut-off at lag 2 for the pacf? It is thus necessary to try different ARMA models. It results that the best one si the ARMA(1,2) model. Note that to include a regressor in an ARMA model we use the xreg
argument in the function Arima
:
= Arima(usconsumption[,1], order=c(1,0,2),
modarimax xreg = usconsumption[,2], include.constant = TRUE)
summary(modarimax)
## Series: usconsumption[, 1]
## Regression with ARIMA(1,0,2) errors
##
## Coefficients:
## ar1 ma1 ma2 intercept xreg
## 0.6516 -0.5440 0.2187 0.5750 0.2420
## s.e. 0.1468 0.1576 0.0790 0.0951 0.0513
##
## sigma^2 estimated as 0.3502: log likelihood=-144.27
## AIC=300.54 AICc=301.08 BIC=319.14
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.001835782 0.5827238 0.4375789 -Inf Inf 0.6298752 0.000656846
$coef modarimax
## ar1 ma1 ma2 intercept xreg
## 0.6516382 -0.5440223 0.2186652 0.5749932 0.2419798
The estimated model is given by the following two equations:
\(Y_t=\) 0.57+ 0.24 \(x_t+u_t\)
\(u_t\) = 0.65 \(u_{t-1}+\epsilon_t\) -0.54 \(\epsilon_{t-1}\) + 0.22 \(\epsilon_{t-2}\)
where \(u_t\) are the correlated regression errors and \(\epsilon_t\) are the ARMA errors that should be as realizations from the white-noise process. We thus check if the ARMA errors are uncorrelated:
checkresiduals(modarimax, lag=14)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(1,0,2) errors
## Q* = 9.7082, df = 9, p-value = 0.3746
##
## Model df: 5. Total lags used: 14
We are interested in testing if the autocorrelation for lag \(h=14\) is a false positive or a significant correlation (we thus choose lag=14
). Given the p-value we can conclude that the ARMA residuals are uncorrelated.