8.2 ARMA processes

Since the seminal work of George E. P. Box and Jenkins (1976), autoregressive moving average (ARMA) models have become ubiquitous in time series analysis. Thus, we present a brief introduction to these models in this section.

Let’s start with the linear Gaussian model with autoregressive errors:

$\begin{align} y_t &= \boldsymbol{x}_t^{\top} \boldsymbol{\beta} + \mu_t \tag{8.5} \\ \phi(L) \mu_t &= \epsilon_t \tag{8.6} \end{align}$

where $\boldsymbol{x}_t$ is a $K$ -dimensional vector of regressors, $\epsilon_t \sim \text{iid} \, N(0, \sigma^2)$ , and $\phi(L) = 1 - \phi_1 L - \phi_2 L^2 - \dots - \phi_p L^p$ is a polynomial in the lag operator $L$ , where $L z_t = z_{t-1}$ , and in general, $L^r z_t = z_{t-r}$ .

Thus, we see that the stochastic error $\mu_t$ follows an autoregressive process of order $p$ , i.e., $\mu_t \sim AR(p)$ . It is standard practice to assume that $\mu_t$ is second-order stationary, meaning the mean, variance, and autocovariance of $\mu_t$ are finite and independent of $t$ and $s$ , although $\mathbb{E}[\mu_t \mu_s]$ may depend on $|t - s|$ . Then, all roots of $\phi(L)$ lie outside the unit circle. For instance, for an $AR(1)$ , $1 - \phi_1 L = 0$ , implying $L = 1/\phi_1$ , such that $|\phi_1| < 1$ for the process to be second-order stationary.

The likelihood function conditional on the first $p$ observations is:

$\begin{align*} p(y_{p+1}, \dots, y_T \mid y_{p}, \dots, y_1, \boldsymbol{\theta}) &= \prod_{t=p+1}^{T} p(y_t \mid \mathcal{I}_{t-1}, \boldsymbol{\theta}) \\ &\propto \sigma^{-(T-p)} \exp\left\{-\frac{1}{2\sigma^2} \sum_{t=p+1}^T \left(y_t - \hat{y}_{t \mid t-1, \boldsymbol{\theta}}\right)^2 \right\} \end{align*}$

where $\mathcal{I}_{t-1}$ is the past information, $\boldsymbol{\theta}$ collects all parameters $(\boldsymbol{\beta}, \phi_1, \dots, \phi_p, \sigma^2)$ , and $\hat{y}_{t \mid t-1, \boldsymbol{\theta}} = (1 - \phi(L)) y_t + \phi(L) \boldsymbol{x}^{\top} \boldsymbol{\beta}$ .

We can see that multiplying the first expression in Equation (8.5) by $\phi(L)$ , we can express the model as $\begin{align} y_t^*=\boldsymbol{x}_t^{*\top}\boldsymbol{\beta}+\epsilon_t \tag{8.7} \end{align}$ where $y_t^*=\phi(L)Y_t$ and $\boldsymbol{x}_t^{*}=\phi(L)\boldsymbol{x}_t$ .

Thus, collecting all observations $t=p+1,p+2,\dots,T$ , we have $\boldsymbol{y}^*=\boldsymbol{X}^*\boldsymbol{\beta}+\boldsymbol{\epsilon}$ , where $\boldsymbol{\epsilon}\sim N(\boldsymbol{0},\sigma^2\boldsymbol{I}_{T-p})$ , $\boldsymbol{y}^*$ is a $T-p$ dimensional vector, and $\boldsymbol{X}^*$ is a $(T-p)\times K$ dimensional matrix.

Assuming that $\boldsymbol{\beta}\mid \sigma\sim N(\boldsymbol{\beta}_0,\sigma^2\boldsymbol{B}_0)$ , $\sigma^2\sim IG(\alpha_0/2,\delta_0/2)$ and $\boldsymbol{\phi}\sim N(\boldsymbol{\phi}_0,\boldsymbol{\Phi}_0)\mathbb{1}(\boldsymbol{\phi}\in S_{\boldsymbol{\phi}})$ , where $S_{\boldsymbol{\phi}}$ is the stationary region of $\boldsymbol{\phi}=[\phi_1 \ \dots \ \phi_p]^{\top}$ . Then, Equation (8.7) implies that $\boldsymbol{\beta}\mid \sigma^2,\boldsymbol{\phi},\boldsymbol{y},\boldsymbol{X}\sim N(\boldsymbol{\beta}_n, \sigma^2{\boldsymbol{B}}_n)$ , where $\boldsymbol{B}_n = (\boldsymbol{B}_0^{-1} + \boldsymbol{X}^{*\top}\boldsymbol{X}^{*})^{-1}$ and $\boldsymbol{\beta}_n = \boldsymbol{B}_n(\boldsymbol{B}_0^{-1}\boldsymbol{\beta}_0 + \boldsymbol{X}^{*\top}\boldsymbol{y}^{*})$ . In addition, $\sigma^2\mid \boldsymbol{\beta},\boldsymbol{\phi},\boldsymbol{y},\boldsymbol{X}\sim IG(\alpha_n/2,\delta_n/2)$ where $\alpha_n=\alpha_0+T-p$ and $\delta_n=\delta_0+(\boldsymbol{y}^*-\boldsymbol{X}^{*}\boldsymbol{\beta})^{\top}(\boldsymbol{y}^*-\boldsymbol{X}^{*}\boldsymbol{\beta})+(\boldsymbol{\beta}-\boldsymbol{\beta}_0)\boldsymbol{B}_0^{-1}(\boldsymbol{\beta}-\boldsymbol{\beta}_0)$ . Thus, the previous conditional posterior distributions imply that we can use a Gibbs sampling algorithm to perform inference of these parameters (Siddhartha Chib 1993).

We know from Equation q(eq:eq1) that $\mu_t=y_t-\boldsymbol{x}_t^{\top}\boldsymbol{\beta}$ , from Equation (8.6) that $\mu_t=\phi_1\mu_{t-1}+\dots+\phi_p\mu_{t-p}+\epsilon_t$ , $t=p+1,\dots,T$ . In matrix notation $\boldsymbol{\mu}=\boldsymbol{U}\boldsymbol{\phi}+\boldsymbol{\epsilon}$ , where $\boldsymbol{\mu}$ is a $T-p$ dimensional vector, $\boldsymbol{U}$ is a $(T-p)\times p$ matrix whose $t$ -th row is $[\mu_{t-1} \ \dots \ \mu_{t-p}]$ . Thus, the posterior distribution of $\boldsymbol{\phi}\mid \boldsymbol{\beta},\sigma^2,\boldsymbol{y},\boldsymbol{X}$ is $N(\boldsymbol{\phi}_n, \boldsymbol{\Phi}_n)\mathbb{1}(\boldsymbol{\phi}\in S_{\boldsymbol{\phi}})$ , where $\boldsymbol{\Phi}_n=(\boldsymbol{\Phi}_0^{-1}+\sigma^{-2}\boldsymbol{U}^{\top}\boldsymbol{U})$ and $\boldsymbol{\phi}_n=\boldsymbol{\Phi}_n(\boldsymbol{\Phi}_0^{-1}\boldsymbol{\phi}_0+\sigma^{-2}\boldsymbol{U}^{\top}\boldsymbol{\mu})$ (see Exercise 4).

Drawing from the model under the stationarity restriction is straightforward: we simply sample from the multivariate normal distribution and discard draws that do not satisfy the stationarity condition. The proportion of draws that meet this restriction represents the conditional probability that the process is stationary.

Example: Effects of inflation on interest rate II

We specify a dynamic linear model in the example of the effects of inflation on interest rates to account for a potential dynamic relationship. However, we can introduce dynamics in this model by assuming $\Delta i_t = \beta_{1} + \beta_{2} \Delta inf_t + \beta_{3} \Delta def_t + \mu_t,$ where $\mu_t = \phi \mu_{t-1} + \epsilon_t$ . This leads to the model: $\Delta i_t = \beta_{1}(1-\phi_1) + \phi_1 \Delta i_{t-1} + \beta_{2}(\Delta inf_t - \phi_1 \Delta inf_{t-1}) + \beta_{3}(\Delta def_t - \phi_1 \Delta def_{t-1}) + \epsilon_t.$ Thus, we again use the dataset 16INTDEF.csv provided by Jeffrey M. Wooldridge (2016) to illustrate linear regressions with $AR(1)$ errors.

The following code demonstrates how to implement this application using vague priors, assuming $\alpha_0 = \delta_0 = 0.01$ , $\boldsymbol{\beta}_0 = \boldsymbol{0}$ , $\boldsymbol{B}_0 = \boldsymbol{I}$ , $\boldsymbol{\phi}_0 = \boldsymbol{0}$ , and $\boldsymbol{\Phi}_0 = \boldsymbol{I}$ . We use 15,000 MCMC iterations, with a burn-in of 5,000 and a thinning parameter of 5.

rm(list = ls())
set.seed(010101)
DataIntRate <- read.csv("https://raw.githubusercontent.com/besmarter/BSTApp/refs/heads/master/DataApp/16INTDEF.csv", sep = ",", header = TRUE, quote = "")
attach(DataIntRate)

## The following objects are masked from DataIntRate (pos = 4):
## 
##     def, i3, inf, Year

yt <- diff(i3); ytlag <- dplyr::lag(yt, n = 1)
T <- length(yt)
Xt <- cbind(diff(inf), diff(def)); Xtlag <- dplyr::lag(Xt, n = 1)
K <- dim(Xt)[2] + 1
Reg <- lm(yt ~ ytlag + I(Xt[,-1] - Xtlag))
SumReg <- summary(Reg); SumReg

## 
## Call:
## lm(formula = yt ~ ytlag + I(Xt[, -1] - Xtlag))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.85560 -0.86022  0.04917  0.95966  2.85684 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           0.07375    0.19570   0.377   0.7081  
## ytlag                 0.13495    0.15530   0.869   0.3897  
## I(Xt[, -1] - Xtlag)1 -0.13756    0.07594  -1.811   0.0771 .
## I(Xt[, -1] - Xtlag)2 -0.14807    0.09212  -1.607   0.1153  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.331 on 43 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1506, Adjusted R-squared:  0.09137 
## F-statistic: 2.542 on 3 and 43 DF,  p-value: 0.06878

PostSig2 <- function(Beta, Phi){
    Xstar<- matrix(NA, T-1, K - 1)
    ystar <- matrix(NA, T-1, 1)
    for(t in 2:T){
        Xstar[t-1,] <- Xt[t,] - Phi*Xt[t-1,]
        ystar[t-1,] <- yt[t] - Phi*yt[t-1]
    }
    Xstar <- cbind(1, Xstar)
    an <- T - 1 + a0
    dn <- d0 + t(ystar - Xstar%*%Beta)%*%(ystar - Xstar%*%Beta) + t(Beta - b0)%*%B0i%*%(Beta - b0)
    sig2 <- invgamma::rinvgamma(1, shape = an/2, rate = dn/2)
    return(sig2)
}
PostBeta <- function(sig2, Phi){
    Xstar<- matrix(NA, T-1, K - 1)
    ystar <- matrix(NA, T-1, 1)
    for(t in 2:T){
        Xstar[t-1,] <- Xt[t,] - Phi*Xt[t-1,]
        ystar[t-1,] <- yt[t] - Phi*yt[t-1]
    }
    Xstar <- cbind(1, Xstar)
    XtXstar <- t(Xstar)%*%Xstar
    Xtystar <- t(Xstar)%*%ystar
    Bn <- solve(B0i + XtXstar)
    bn <- Bn%*%(B0i%*%b0 + Xtystar)
    Beta <- MASS::mvrnorm(1, bn, sig2*Bn)
    return(Beta)
}
PostPhi <- function(sig2, Beta){
    u <- yt - cbind(1,Xt)%*%Beta
    U <- u[-T]
    ustar <- u[-1]
    UtU <- t(U)%*%U
    Utu <- t(U)%*%ustar
    Phin <- solve(Phi0i + sig2^(-1)*UtU)
    phin <- Phin%*%(Phi0i%*%phi0 + sig2^(-1)*Utu)
    Phi <- truncnorm::rtruncnorm(1, a = -1, b = 1, mean = phin, sd = Phin^0.5)
    return(Phi)
}
# Hyperparameters
d0 <- 0.01; a0 <- 0.01
b0 <- rep(0, K); c0 <- 1; 
B0 <- c0*diag(K); B0i <- solve(B0)
phi0 <- 0; Phi0 <- 1; Phi0i <- 1/Phi0
# MCMC parameters
mcmc <- 15000
burnin <- 5000
tot <- mcmc + burnin
thin <- 1
PostBetas <- matrix(0, mcmc+burnin, K)
PostSigma2s <- rep(0, mcmc+burnin)
PostPhis <- rep(0, mcmc+burnin)
Beta <- rep(0, K); Phi <- 0
sig2 <- SumReg$sigma^2; Phi <- SumReg$coefficients[2,1]
Beta <- SumReg$coefficients[c(1,3,4),1]
pb <- winProgressBar(title = "progress bar", min = 0, max = tot, width = 300)
for(s in 1:tot){
    sig2 <- PostSig2(Beta = Beta, Phi = Phi)
    PostSigma2s[s] <- sig2
    Beta <- PostBeta(sig2 = sig2, Phi = Phi)
    PostBetas[s,] <- Beta
    Phi <- PostPhi(sig2 = sig2, Beta = Beta)
    PostPhis[s] <- Phi
    setWinProgressBar(pb, s, title=paste( round(s/tot*100, 0), "% done"))
}
close(pb)

## NULL

keep <- seq((burnin+1), tot, thin)
PosteriorBetas <- coda::mcmc(PostBetas[keep,])
summary(PosteriorBetas)

## 
## Iterations = 1:15000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 15000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean     SD  Naive SE Time-series SE
## [1,]  0.0663 0.1904 0.0015545       0.001554
## [2,]  0.2293 0.1171 0.0009563       0.001163
## [3,] -0.1451 0.1704 0.0013913       0.001391
## 
## 2. Quantiles for each variable:
## 
##           2.5%      25%      50%      75%  97.5%
## var1 -0.307875 -0.05914  0.06597  0.19276 0.4392
## var2 -0.004462  0.15191  0.22912  0.30839 0.4564
## var3 -0.475203 -0.25696 -0.14711 -0.03156 0.1918

PosteriorSigma2 <- coda::mcmc(PostSigma2s[keep])
summary(PosteriorSigma2)

## 
## Iterations = 1:15000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 15000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##       1.698250       0.385326       0.003146       0.003462 
## 
## 2. Quantiles for each variable:
## 
##  2.5%   25%   50%   75% 97.5% 
## 1.110 1.423 1.643 1.909 2.611

PosteriorPhi <- coda::mcmc(PostPhis[keep])
summary(PosteriorPhi)

## 
## Iterations = 1:15000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 15000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##       0.020208       0.163454       0.001335       0.001705 
## 
## 2. Quantiles for each variable:
## 
##     2.5%      25%      50%      75%    97.5% 
## -0.28977 -0.09105  0.01602  0.12843  0.34971

dfBinf <- as.data.frame(PosteriorBetas[,2])
# Basic density
library(ggplot2)
p <- ggplot(dfBinf, aes(x=var1)) + 
geom_density(color="darkblue", fill="lightblue") +
geom_vline(aes(xintercept=mean(var1)), color="blue", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=quantile(var1, 0.025)), color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=quantile(var1, 0.975)), color="red", linetype="dashed", linewidth=1) +
labs(title="Density effect of inflation on interest rate", x="Effect of inflation", y = "Density")
p

This figure shows the posterior density plot of the effects of inflation rate on interest rate. The posterior mean of this coefficient is approximately 0.25, and the credible interval at 95% is (0, 0.46), which indicates again that the annual changes in interest rate are weakly positive related to annual changes in inflation.

Observe that the previous setting encompasses the particular relevant case $y_t\sim AR(p)$ , it is just omitting the covariates such that $y_t=\mu_t$ . Siddhartha Chib and Greenberg (1994) extend the Bayesian inference of linear regression with $AR(p)$ errors to $ARMA(p,q)$ errors using a state-space representation.

Setting $y_t=\mu_t$ such that $y_t=\sum_{s=1}^{p}\phi_jy_{t-s}+\sum_{s=1}^{q}\theta_s \epsilon_{t-s}+\epsilon_t$ , letting $r=\max \left\{p,q+1\right\}$ , $\phi_s=0$ for $s>p$ and $\theta_s=0$ for $s>q$ , and defining $\boldsymbol{x}^{\top}=[1 \ 0 \ \dots \ 0]$ , and $\boldsymbol{H}=[1 \ \psi_1 \ \dots \ \psi_{r-1}]^{\top}$ $r$ -dimensional vectors, and $\begin{align*} \boldsymbol{G}=\begin{bmatrix} \phi_1 & 1 & 0 & \dots & 0\\ \phi_2 & 0 & 1 & \dots & 0\\ \vdots & \vdots & \ddots & &\\ \phi_{r-1} & 0 & 0 & \dots & 1\\ \phi_r & 0 & 0 & \dots & 0\\ \end{bmatrix} = \begin{bmatrix} \phi_1 & \vdots & & & \\ \phi_2 & \vdots & & \boldsymbol{I}_{r-1} & \\ \vdots & \vdots & & &\\ \dots & \dots & \dots & \dots & \dots\\ \phi_r & 0 & 0 & \dots & 0\\ \end{bmatrix}, \end{align*}$ which is a $r\times r$ dimensional matrix, and give the state vector $\boldsymbol{\beta}_t=[\beta_{1,t} \ \beta_{2,t} \ \dots \ \beta_{r,t}]^{\top}$ , the $ARMA$ model has the following representation: $\begin{align*} y_t&=\boldsymbol{x}^{\top}\boldsymbol{\beta}_t\\ \boldsymbol{\beta}_t &= \boldsymbol{G}\boldsymbol{\beta}_{t-1}+\boldsymbol{H}\epsilon_{t}. \end{align*}$

This is a dynamic linear model where $\boldsymbol{\Sigma}_t=0$ , and $\boldsymbol{\Omega}_t=\sigma^2\boldsymbol{H}\boldsymbol{H}^{\top}$ (see Petris, Petrone, and Campagnoli (2009) and Siddhartha Chib and Greenberg (1994)).

A notable advantage of the state-space representation of the $ARMA$ model is that the evaluation of the likelihood can be performed efficiently using the recursive laws. Extensions to autoregressive integrated moving average $ARIMA(p,d,q)$ models are discussed in Petris, Petrone, and Campagnoli (2009). In $ARIMA(p,d,q)$ models, $d$ refers to the level of integration (or differencing) required to eliminate the stochastic trend in a time series (see Enders (2014) for details).

Example: $AR(2)$ process

Let’s see the state-space representation of a stationary $AR(2)$ process with intercept, that is, $y_t=\mu+\phi_1y_{t-1}+\phi_2y_{t-2}+\epsilon_t$ , where $\epsilon_t\sim N(0,\sigma^2)$ . Thus, $\mathbb{E}[y_t]=\frac{\mu}{1-\phi_1-\phi_2}$ , and variance $Var[y_t]=\frac{\sigma^2(1-\phi_2)}{1-\phi_2-\phi_1^2-\phi_1^2\phi_2-\phi_2^2+\phi_2^3}$ .

In addition, we can proof that setting $z_t=Y_t-\bar{\mu}$ , we have $z_t=\phi_1z_{t-1}+\phi_2z_{t-2}+\epsilon_t$ where $\mathbb{E}[z_t]=0$ , and these are equivalent representations (see Exercise 5). Then, setting $\boldsymbol{x}^{\top}=[1 \ 0]$ , $\boldsymbol{H}=[1 \ 0]^{\top}$ , $\boldsymbol{G}=\begin{bmatrix} \phi_1 & 1\\ \phi_2 & 0 \\ \end{bmatrix}$ , $\boldsymbol{\beta}_t=[\beta_{t1} \ \beta_{t2}]^{\top}$ , $\boldsymbol{\Sigma}_t=0$ and $\boldsymbol{\Omega}_t=\sigma^2$ we have $\begin{align*} z_t&=\boldsymbol{x}^{\top}\boldsymbol{\beta}_t& \text{(Observation equation)}\\ \boldsymbol{\beta}_t&=\boldsymbol{G}\boldsymbol{\beta}_{t-1}+\boldsymbol{H}{\epsilon}_t & \text{(States equations)}. \end{align*}$

We use the function stan_sarima from the package bayesforecast to perform Bayesian inference in $ARMA$ models in our GUI. The following code shows how to simulate an $AR(2)$ process, and perform Bayesian inference using this function.

We perform 10000 MCMC iterations plus a burn-in equal 5000 assuming $\sigma^2\sim IG(0.01/2, 0.01/2)$ , $\mu\sim N(0, 1)$ and $\phi_k\sim N(0, 1)$ , $k=1,2$ . The trace plots look well, and all 95% credible intervals encompass the population values.

rm(list = ls()); set.seed(010101)
T <- 200; mu <- 0.5 
phi1 <- 0.5; phi2 <- 0.3; sig <- 0.5
Ey <- mu/(1-phi1-phi2); Sigy <- sig*((1-phi2)/(1-phi2-phi1^2-phi2*phi1^2-phi2^2+phi2^3))^0.5 
y <- rnorm(T, mean = Ey, sd = Sigy)
e <- rnorm(T, mean = 0, sd = sig)
for(t in 3:T){
    y[t] <- mu + phi1*y[t-1] + phi2*y[t-2] + e[t]
}
mean(y); sd(y)

## [1] 2.642552

## [1] 0.7734486

y <- ts(y, start=c(1820, 1), frequency=1)
plot(y)

iter <- 10000; burnin <- 5000; thin <- 1; tot <- iter + burnin
library(bayesforecast)

## Warning: package 'bayesforecast' was built under R version 4.3.3

## Registered S3 methods overwritten by 'bayesforecast':
##   method        from         
##   autoplot.ts   forecast     
##   forecast.ts   forecast     
##   fortify.ts    forecast     
##   print.garch   tseries      
##   print.laplace LaplacesDemon

## 
## Attaching package: 'bayesforecast'

## The following objects are masked from 'package:base':
## 
##     beta, gamma

sf1 <- bayesforecast::stan_sarima(y, order = c(2, 0, 0), prior_mu0 = normal(0, 1),
prior_ar = normal(0, 1), prior_sigma0 = inverse.gamma(0.01/2, 0.01/2),
seasonal = c(0, 0, 0), iter = tot, warmup = burnin, chains = 1)

## 
## SAMPLING FOR MODEL 'Sarima' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 9.4e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.94 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:     1 / 15000 [  0%]  (Warmup)
## Chain 1: Iteration:  1500 / 15000 [ 10%]  (Warmup)
## Chain 1: Iteration:  3000 / 15000 [ 20%]  (Warmup)
## Chain 1: Iteration:  4500 / 15000 [ 30%]  (Warmup)
## Chain 1: Iteration:  5001 / 15000 [ 33%]  (Sampling)
## Chain 1: Iteration:  6500 / 15000 [ 43%]  (Sampling)
## Chain 1: Iteration:  8000 / 15000 [ 53%]  (Sampling)
## Chain 1: Iteration:  9500 / 15000 [ 63%]  (Sampling)
## Chain 1: Iteration: 11000 / 15000 [ 73%]  (Sampling)
## Chain 1: Iteration: 12500 / 15000 [ 83%]  (Sampling)
## Chain 1: Iteration: 14000 / 15000 [ 93%]  (Sampling)
## Chain 1: Iteration: 15000 / 15000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 5.066 seconds (Warm-up)
## Chain 1:                10.572 seconds (Sampling)
## Chain 1:                15.638 seconds (Total)
## Chain 1:

keep <- seq(burnin+1, tot, thin)
Postmu <- sf1[["stanfit"]]@sim[["samples"]][[1]][["mu0"]][keep]
Postsig <- sf1[["stanfit"]]@sim[["samples"]][[1]][["sigma0"]][keep]
Postphi1 <- sf1[["stanfit"]]@sim[["samples"]][[1]][["ar0[1]"]][keep]
Postphi2 <- sf1[["stanfit"]]@sim[["samples"]][[1]][["ar0[2]"]][keep]
Postdraws <- coda::mcmc(cbind(Postmu, Postsig, Postphi1, Postphi2))
summary(Postdraws)

## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean      SD  Naive SE Time-series SE
## Postmu   0.6628 0.13547 0.0013547      0.0017646
## Postsig  0.5260 0.02700 0.0002700      0.0003311
## Postphi1 0.5624 0.06932 0.0006932      0.0009865
## Postphi2 0.1916 0.06782 0.0006782      0.0009535
## 
## 2. Quantiles for each variable:
## 
##             2.5%    25%    50%    75%  97.5%
## Postmu   0.39914 0.5732 0.6625 0.7518 0.9346
## Postsig  0.47696 0.5071 0.5248 0.5439 0.5829
## Postphi1 0.42384 0.5159 0.5634 0.6089 0.6979
## Postphi2 0.06034 0.1456 0.1920 0.2361 0.3286

The following Algorithm shows how to perform inference in $ARMA(p,q)$ models using our GUI. See also Chapter 5 for details regarding the dataset structure.

Algorithm: Autoregressive Moving Average (ARMA) Models

Select Time series Model on the top panel
Select ARMA using the left radio button
Upload the dataset selecting first if there is a header in the file, and the kind of separator in the csv file of the dataset (comma, semicolon, or tab). Then, use the Browse button under the Choose File legend
Select MCMC iterations, burn-in, and thinning parameters using the Range sliders
Set the order of the ARMA model, p and q parameters
Set the frequency: annual (1), quarterly (4), monthly (12), etc.
Set the location and scale hyperparameters of the intercept, autoregressive (AR), moving average (MA) and standard deviation. Take into account that there is just one set of hyperparameters for AR and MA coefficients. This step is not necessary as by default our GUI uses non-informative priors
Click the Go! button
Analyze results
Download posterior chains and figures (density, autocorrelation and trace plots) using the Download Results button

The function stan_sarima uses the Stan software (Stan Development Team 2024), which in turn employs Hamiltonian Monte Carlo (HMC). The following code illustrates how to perform Bayesian inference in the $AR(2)$ model by programming the HMC from scratch. It is important to note that this is only an illustration, as HMC is less efficient than the Gibbs sampler in this example. However, HMC can outperform traditional MCMC algorithms in more complex models, particularly when dealing with high-dimensional probability distributions or when MCMC struggles with poor mixing due to posterior correlation.

In the first block, we perform the simulation by setting $\mu=0.5$ , $\phi_1=0.5$ , $\phi_2=0.3$ , $\sigma=0.25$ , and a sample size of 200. We then set the hyperparameters and define the function to calculate the logarithm of the posterior distribution. The model is parametrized using $\tau = \log(\sigma^2)$ , such that $\sigma^2=\exp(\tau)$ , which avoids issues related to the non-negativity restriction of $\sigma^2$ . As a result, we need to account for the Jacobian due to this transformation, specifically $d\sigma^2/d\tau = \exp(\tau)$ .

Next, we define the function to compute the gradient vector of the log posterior distribution. It is preferable to calculate the gradient vector analytically, as using finite differences can be computationally expensive. However, it is a good practice to check the analytical calculations by evaluating the function at the maximum posterior estimate, where the function should return values close to 0, or by comparing the results with finite differences at a few evaluation points.

The posterior distribution is given by⁴⁵ $\begin{align*} \pi(\mu,\phi_1,\phi_2,\tau\mid \boldsymbol{y})&\propto \prod_{t=3}^T(\exp(\tau))^{-1/2}\exp\left\{-\frac{1}{2\exp(\tau)}(y_t-\mu-\phi_1y_{t-1}-\phi_2y_{t-2})^2\right\}\\ &\times\exp\left\{-\frac{1}{2\sigma^2_{\mu}}(\mu-\mu_0)^2\right\}\times\exp\left\{-\frac{1}{2\sigma^2_{\phi_1}}(\phi_1-\phi_{10})^2\right\}\\ &\times\exp\left\{-\frac{1}{2\sigma^2_{\phi_2}}(\phi_2-\phi_{20})^2\right\}\times\exp\left\{-(\alpha_0/2+1)\tau\right\}\exp\left\{-\delta_0/(2\exp(\tau))\right\}\exp(\tau). \end{align*}$

The components of the gradient vector of the log posterior distribution are given by $\begin{align*} \frac{\partial \log(\pi(\mu,\phi_1,\phi_2,\tau\mid \boldsymbol{y}))}{\partial\mu}&=\frac{\sum_{t=3}^T(y_t-\mu-\phi_1y_{t-1}-\phi_2y_{t-2})}{\exp(\tau)}-\frac{1}{\sigma_{\mu}^2}(\mu-\mu_0)\\ \frac{\partial\log(\pi(\mu,\phi_1,\phi_2,\tau\mid \boldsymbol{y}))}{\partial\phi_1}&=\frac{\sum_{t=3}^T(y_t-\mu-\phi_1y_{t-1}-\phi_2y_{t-2})y_{t-1}}{\exp(\tau)}-\frac{1}{\sigma_{\phi_1}^2}(\phi_1-\phi_{10})\\ \frac{\partial\log(\pi(\mu,\phi_1,\phi_2,\tau\mid \boldsymbol{y}))}{\partial\phi_2}&=\frac{\sum_{t=3}^T(y_t-\mu-\phi_1y_{t-1}-\phi_2y_{t-2})y_{t-2}}{\exp(\tau)}-\frac{1}{\sigma_{\phi_2}^2}(\phi_2-\phi_{20})\\ \frac{\partial\log(\pi(\mu,\phi_1,\phi_2,\tau\mid \boldsymbol{y}))}{\partial\tau}&=-\frac{(T-2)}{2}+\frac{\sum_{t=3}^T(y_t-\mu-\phi_1y_{t-1}-\phi_2y_{t-2})^2}{2\exp(\tau)}\\ &-(\alpha_0/2+1)+\delta_0/(2\exp(\tau))+1.\\ \end{align*}$

Next, we provide the code for the Hamiltonian Monte Carlo, as outlined in Chapter 4. The initial values are set as follows: $\mu=\bar{y}=\frac{1}{T-2}\sum_{t=3}^T y_t$ , $\phi_1=\phi_2=0$ , and $\tau=\exp\left(\frac{1}{T-2}\sum_{t=3}^T(y_t-\bar{y})^2\right)$ , with $M$ being the inverse covariance matrix of the posterior distribution evaluated at its maximum. Additionally, $\epsilon$ is randomly drawn from a uniform distribution between 0 and $2\epsilon_0$ , and $L$ is set to the highest integer near $1/\epsilon$ , in order to approximately satisfy $L\epsilon=1$ , where $\epsilon_0=0.1$ .

We can verify that all 95% credible intervals encompass the population values, and the posterior means are close to the population values. The acceptance rate averages above 65%, so we should consider increasing the base step ( $\epsilon_0$ ). Furthermore, we do not impose the stationarity conditions on $\phi_1$ and $\phi_2$ . Exercise 6 asks to program an HMC that takes these requirements into account.

# Simulation AR(2)
rm(list = ls()); set.seed(010101); T <- 1000; K <- 4 
mu <- 0.5; phi1 <- 0.5; phi2 <- 0.3; sig <- 0.5 
Ey <- mu/(1-phi1-phi2); Sigy <- sig*((1-phi2)/(1-phi2-phi1^2-phi2*phi1^2-phi2^2+phi2^3))^0.5 
y <- rnorm(T, mean = Ey, sd = Sigy); e <- rnorm(T, mean = 0, sd = sig)
for(t in 3:T){
    y[t] <- mu + phi1*y[t-1] + phi2*y[t-2] + e[t]
}
# Hyperparameters
d0 <- 0.01; a0 <- 0.01; mu0 <- 0; MU0 <- 1
phi0 <- c(0, 0); Phi0 <- diag(2)
# Log posterior multiply by -1 to use optim
LogPost <- function(theta, y){
    mu <- theta[1]; phi1 <- theta[2]; phi2 <- theta[3]
    tau <- theta[4]; sig2 <- exp(tau); logLik <- NULL
    for(t in 3:T){
        logLikt <- dnorm(y[t], mean = mu + phi1*y[t-1] + phi2*y[t-2], sd = sig2^0.5, log = TRUE)
        logLik <- c(logLik, logLikt)
    }
    logLik <- sum(logLik)
    logPrior <- dnorm(mu, mean = mu0, sd = MU0^0.5, log = TRUE) + dnorm(phi1, mean = phi0[1], sd = Phi0[1,1]^0.5, log = TRUE) + dnorm(phi2, mean = phi0[2], sd = Phi0[2,2]^0.5, log = TRUE) + invgamma::dinvgamma(sig2, shape = a0/2, rate = d0/2, log = TRUE)
    logPosterior <- logLik + logPrior + tau
    return(-logPosterior) # Multiply by -1 to minimize using optim
}
theta0 <- c(mean(y), 0, 0, var(y))
Opt <- optim(theta0, LogPost, y = y, hessian = TRUE)
theta0 <- Opt$par; VarPost <- solve(Opt$hessian)
# Gradient log posterior
GradientTheta <- function(theta, y){
    mu <- theta[1]; phi1 <- theta[2]; phi2 <- theta[3]
    tau <- theta[4]; sig2 <- exp(tau); SumLik <- matrix(0, 3, 1)
    SumLik2 <- NULL
    for(t in 3:T){
        xt <- matrix(c(1, y[t-1], y[t-2]), 3, 1)
        SumLikt <- (y[t] - (mu + phi1*y[t-1] + phi2*y[t-2]))*xt
        SumLik2t <- (y[t] - (mu + phi1*y[t-1] + phi2*y[t-2]))^2
        SumLik <- rowSums(cbind(SumLik, SumLikt))
        SumLik2 <- sum(SumLik2, SumLik2t)
    }
    Grad_mu <- SumLik[1]/sig2 - (1/MU0)*(mu - mu0)
    Grad_phi1 <- SumLik[2]/exp(tau) - 1/Phi0[1,1]*(phi1 - phi0[1])
    Grad_phi2 <- SumLik[3]/exp(tau) - 1/Phi0[2,2]*(phi2 - phi0[2])
    Grad_tau <- -(T-2)/2 + SumLik2/(2*exp(tau)) - (a0/2 + 1) + d0/(2*exp(tau)) + 1 
    Grad <- c(Grad_mu, Grad_phi1, Grad_phi2, Grad_tau)
    return(Grad)
}
# Hamiltonian Monte Carlo function
HMC <- function(theta, y, epsilon, M){
    L <- ceiling(1/epsilon)
    Minv <- solve(M); thetat <- theta
    K <- length(thetat)
    mom <- t(mvtnorm::rmvnorm(1, rep(0, K), M))
    logPost_Mom_t <- -LogPost(thetat, y) +  mvtnorm::dmvnorm(t(mom), rep(0, K), M, log = TRUE)  
    for(l in 1:L){
        if(l == 1 | l == L){
            mom <- mom + 0.5*epsilon*GradientTheta(theta, y)
            theta <- theta + epsilon*Minv%*%mom
        }else{
            mom <- mom + epsilon*GradientTheta(theta, y)
            theta <- theta + epsilon*Minv%*%mom
        }
    }
    logPost_Mom_star <- -LogPost(theta, y) +  mvtnorm::dmvnorm(t(mom), rep(0, K), M, log = TRUE)  
    alpha <- min(1, exp(logPost_Mom_star-logPost_Mom_t))
    u <- runif(1)
    if(u <= alpha){
        thetaNew <- c(theta)
    }else{
        thetaNew <- thetat
    }
    rest <- list(theta = thetaNew, Prob = alpha)
    return(rest)
}
# Posterior draws
S <- 1000; burnin <- 1000; thin <- 2; tot <- S + burnin
thetaPost <- matrix(NA, tot, K)
ProbAccept <- rep(NA, tot)
theta0 <- c(mean(y), 0, 0, exp(var(y))) 
M <- solve(VarPost); epsilon0 <- 0.1
pb <- winProgressBar(title = "progress bar", min = 0, max = tot, width = 300)
for(s in 1:tot){
    epsilon <- runif(1, 0, 2*epsilon0)
    L <- ceiling(1/epsilon)
    HMCs <- HMC(theta = theta0, y, epsilon, M) 
    theta0 <- HMCs$theta 
    thetaPost[s,] <- HMCs$theta
    ProbAccept[s] <- HMCs$Prob
    setWinProgressBar(pb, s, title=paste( round(s/tot*100, 0), "% done"))
}
close(pb)

## NULL

keep <- seq((burnin+1), tot, thin)
thetaF <- coda::mcmc(thetaPost[keep,])
summary(thetaF)

## 
## Iterations = 1:500
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 500 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean      SD Naive SE Time-series SE
## [1,]  0.4830 0.05773 0.002582       0.003488
## [2,]  0.5167 0.02843 0.001272       0.001698
## [3,]  0.2927 0.02918 0.001305       0.001481
## [4,] -1.3317 0.04577 0.002047       0.002756
## 
## 2. Quantiles for each variable:
## 
##         2.5%     25%     50%     75%   97.5%
## var1  0.3607  0.4470  0.4864  0.5212  0.5902
## var2  0.4536  0.5009  0.5160  0.5334  0.5721
## var3  0.2363  0.2728  0.2923  0.3139  0.3484
## var4 -1.4202 -1.3604 -1.3331 -1.3020 -1.2431

summary(exp(thetaF[,K]))

## 
## Iterations = 1:500
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 500 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##      0.2642942      0.0121270      0.0005423      0.0007275 
## 
## 2. Quantiles for each variable:
## 
##   2.5%    25%    50%    75%  97.5% 
## 0.2417 0.2566 0.2637 0.2720 0.2885

ProbAcceptF <- coda::mcmc(ProbAccept[keep])
summary(ProbAcceptF)

## 
## Iterations = 1:500
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 500 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##       0.916318       0.129812       0.005805       0.005805 
## 
## 2. Quantiles for each variable:
## 
##   2.5%    25%    50%    75%  97.5% 
## 0.5470 0.8746 0.9841 1.0000 1.0000

References

Box, George E. P., and Gwilym M. Jenkins. 1976. Time Series Analysis: Forecasting and Control. 1st ed. San Francisco: Holden-Day.

Chib, Siddhartha. 1993. “Bayes Regression with Autoregressive Errors: A Gibbs Sampling Approach.” Journal of Econometrics 58 (3): 275–94.

Chib, Siddhartha, and Edward Greenberg. 1994. “Bayes Inference in Regression Models with ARMA (p, q) Errors.” Journal of Econometrics 64 (1-2): 183–206.

Enders, Walter. 2014. Applied Econometric Time Series. 4th ed. Hoboken, NJ: Wiley.

Petris, Giovanni, Sonia Petrone, and Patrizia Campagnoli. 2009. “Dynamic Linear Models.” In Dynamic Linear Models with r, 31–84. Springer.

———. 2024. “Stan Modeling Language Users Guide and Reference Manual, 2024.” http://mc-stan.org/.

———. 2016. Introductory Econometrics: A Modern Approach. 6th ed. Boston, MA: Cengage Learning.

Take into account that we do not consider the first two observations when present the likelihood, this is no an issue when there is a large sample size.↩︎