10 Modeling Daily Returns with the GARCH Model
Updated: May 12, 2021
Copyright © Eric Zivot 2016
Outline
- Motivation
- Review stylized facts of daily return data and contrast with monthly returns
- Modeling daily returns is most useful for short term risk analysis of assets and portfolios using volatility and VaR
- CER model captures most stylized facts of monthly returns but not daily returns
- Need a model for time varying volatility
- Engle’s ARCH model captures stylized facts of daily returns and he was awarded Nobel prize in economics partially for this model.
- Engle’s ARCH Model
- Start with ARCH(1) process
- State assumptions and derive properties
- Show simulated values and do reality check against actual data
- Introduce rugarch R package
- General ARCH(p) process
- Bollerslev’s GARCH(1,1) process
- No need to go beyond GARCH(1,1)
- Give assumptions and derive basic properties
- Show simulated values and do reality check against actual data
- GARCH(1,1) with Student t errors
- Show simulated values and do reality check against actual data
- Maximum Likelihood Estimation
- Overview of the technique of MLE
- Illustration with CER model and GARCH(1,1)
- Numerical optimization
- Asymptotic properties
- Calculating standard errors
- Maximum likelihood estimation of ARCH and GARCH models
- GARCH(1,1) log-likelihood function
- Forecasting Conditional Volatility from GARCH(1,1)
- Forecasting algorithm
- Multiperiod
- Conditional VaR
- unconditional vs. conditional VaR
- 1-day ahead VaR forecast
- h-day ahead VaR forecast
In chapter 5, it was shown that daily asset returns have some features in common with monthly asset returns and some not. In particular, daily returns have empirical distributions with much fatter tails than the normal distribution and daily returns are not independent over time. Absolute and squared returns are positively autocorrelated and the correlation dies out very slowly. Daily volatility appears to be autocorrelated and, hence, predictable. In this chapter we present a model of daily asset returns, Robert Engle’s (ARCH) model, that can capture these stylized facts that are specific to daily returns. The ARCH model is one of the most important models in the field of financial econometrics, and its creator Robert Engle won the Nobel Prize in Economics in part for his work on the ARCH model and its variants.
The R packages used in this chapter are IntroCompFinR, rugarch and xts. Make sure these package are installed and loaded before running the R examples in this chapter.
- Why do we care about time varying volatility? Volatility is a standard measure of asset risk and if volatility varies over time then asset risk also varies over time. In periods of low volatility, asset risk is low and in periods of high volatility asset risk is high. Time varying volatility also impacts the risk measure value-at-risk (VaR), which gives the dollar loss that could occur over a period of time with a given probability. Volatility is an important driver for the price of call and put options as well as other derivative securities. That volatility is predictable has important implications for accurate pricing models for options and derivatives.
- Why do we care about the predictability of volatility? We can forecast future values of volatility. This allows us to forecast VaR on an asset or portfolio.
10.1 Engle’s ARCH Model
- Goal: create a simple time series model that captures the basic stylized facts of daily return data
- Foundation of the field of financial econometrics
10.1.1 The ARCH(1) Model
Let Rt denote the continuously compounded daily return on an asset. The CER model for Rt can be expressed in the form: Rt=μ+ϵt,t=1,…,T,ϵt=σzt,zt∼iidN(0,1). Here, the unexpected return ϵt is expressed as σzt where σ is the unconditional volatility of ϵt, which is assumed to be constant, and zt=ϵt/σ is the standardized unexpected return. In (10.1)-(10.3), Rt∼iidN(μ,σ2) which is not a good model for daily returns given the stylized facts. The ARCH(1) model for Rt has a similar form: Rt=μ+ϵt,t=1,…,T,ϵt=σtzt,σ2t=ω+α1ϵ2t−1,zt∼iidN(0,1),ω>0,0≤α1<1.
In the ARCH(1), the unexpected return ϵt is expressed as σtzt where σt is the conditional (time t dependent) volatility of ϵt. Hence, the ARCH(1) model extends the CER model to allow for time varying conditional volatility σt.
In the ARCH(1) model (10.4) - (10.8), the restrictions ω>0 and 0≤α1<1 are required to ensure that σ2t>0 and that {Rt} is a covariance stationary time series. Equation (10.6) shows that the return variance at time t, σ2t is a positive linear function of the squared unexpected return at time t−1, ϵ2t−1. This allows the magnitude of yesterday’s return to influence today’s return variance and volatility. In particular, a large (small) value of ϵ2t−1 leads to a large (small) value of σ2t. This feedback from ϵ2t−1 to σ2t can explain some of the volatility clustering observed in daily returns.
10.1.1.1 Statistical properties of the ARCH(1) Model
Let It={Rt,Rt−1,…,R1} denote the information set at time t (conditioning set of random variables) as described in Chapter 4. As we shall show below, the ARCH(1) model (10.4) - (10.8) has the following statistical properties:
- E(ϵt|It)=0, E(ϵt)=0.
- var(Rt|It)=E(ϵ2t|It)=σ2t.
- var(Rt)=E(ϵ2t)=E(σ2t)=ω/(1−α1).
- {Rt} is an uncorrelated process: cov(Rt,Rt−k)=E(ϵtϵt−j)=0 for k>0.
- The distribution of Rt conditional on It−1 is normal with mean μ and variance σ2t.
- The unconditional (marginal) distribution of Rt is not normal and kurt(Rt)≥3.
- {R2t} and {ϵ2t} have a covariance stationary AR(1) model representation. The persistence of the autocorrelations is measured by α1.
These properties of the ARCH(1) model match many of the stylized facts of daily asset returns.
It is instructive to derive each of the above properties as the derivations make use of certain results and tricks for manipulating conditional expectations. First, consider the derivation for property 1. Because σ2t=ω+α1ϵ2t−1 depends on information dated t−1, σ2t∈It−1 and so σ2t may be treated as a constant when conditioning on It−1. Hence, E(ϵt|It)=E(σtzt|It−1)=σtE(zt|It−1)=0. The last equality follows from that fact that zt∼iidN(0,1) which implies that zt is independent of It−1 and so E(zt|It−1)=E(zt)=0. By iterated expectations, it follows that E(ϵt)=E(E(ϵ|It−1))=E(0)=0. The derivation of the second property uses similar computations: var(Rt|It)=E((Rt−μ)2|It−1)=E(ϵ2t|It)=E(σ2tz2t|It−1)=σ2tE(z2t|It−1)=σ2t, where the last equality uses the result E(z2t|It−1)=E(z2t)=1.
To derive property 3, first note that by iterated expectations and property 2 var(Rt)=E(ϵ2t)=E(E(ϵ2t|It))=E(σ2t).
Next, using (10.6) we have that E(σ2t)=ω+α1E(ϵ2t−1). Now use the fact that {Rt}, and hence {ϵt}, is covariance stationary which implies that E(ϵ2t)=E(ϵ2t−1)=E(σ2t) so that E(σ2t)=ω+α1E(σ2t). Solving for E(σ2t) then gives var(Rt)=E(σ2t)=ω1−α1.
For the fourth property, first note that cov(Rt,Rt−k)=E((Rt−μ)(rt−j−μ))=E(ϵtϵt−k)=E(σtztσt−jzt−j). Next, by iterated expectations and the fact that zt∼iidN(0,1) we have E(σtztσt−jzt−j)=E(E(σtztσt−jzt−j|It−1))=E(σtσt−jzt−jE(zt|It−1))=0, and so cov(Rt,Rt−k)=0.
To derive the fifth property, write Rt=μ+ϵt=μ+σtzt. Then, conditional on It−1 and the fact that zt∼iidN(0,1) we have Rt|It−1∼N(μ,σ2t).
The sixth property has two parts. To see that the unconditional distribution of Rt is not a normal distribution, it is useful to express Rt in terms of current and past values of zt . Start with Rt=μ+σtzt. From (10.6), σ2t=ω+α1ϵ2t−1=ω+α1σ2t−1z2t−1 which implies that σt=(ω+α1σ2t−1z2t−1)1/2. Then Rt=μ+(ω+α1σ2t−1z2t−1)1/2zt.
Even though zt∼iidN(0,1), Rt is a complicated nonlinear function of zt and past values of z2t and so Rt cannot be a normally distributed random variable. Next, to see that kurt(Rt)≥3 first note that kurt(Rt)=E((Rt−μ)4)var(Rt)2=E(ϵ4t)ω2/(1−α1)2 Now by iterated expectations and the fact that E(z4t)=3 E(ϵ4t)=E(E(σ4tz4t|It−1))=E(σ4tE(z4t|It−1))=3E(σ4t). Next, using σ4t=(σ2t)2=(ω+α1ϵ2t−1)2=ω2+2ωα1ϵ2t−1+α21ϵ4t−1 we have E(ϵ4t)=3E(ω2+2ωα1ϵ2t−1+α21ϵ4t−1)=3ω2+6ωα1E(ϵ2t−1)+3α21E(ϵ4t−1). Since {Rt} is covariance stationary it follows that E(ϵ2t−1)=E(ϵ2t)=E(σ2t)=ω/(1−α1) and E(ϵ4t−1)=E(ϵ4t). Hence E(ϵ4t)=3ω2+6ωα1(ω/(1−α1))+3α21E(ϵ4t)=3ω2(1+2α11−α1)+3α21E(ϵ4t). Solving for E(ϵ4t) gives E(ϵ4t)=3ω2(1+α1)(1−α1)(1−3α21). Because 0≤α1<1, in order for E(ϵ4t)<∞ it must be the case that 1−3α21>0 which implies that 0≤α21<13 or 0≤α1<0.577. Substituting the above expression for E(ϵ4t) into (10.9) then gives kurt(Rt)=3ω2(1+α1)(1−α1)(1−3α21)×(1−α1)2ω2=31−α211−3α21≥3. Hence, if returns follow an ARCH(1) process with α1>0 then kurt(Rt)>3 which implies that the unconditional distribution of returns has fatter tails than the normal distribution.
To show the last property, add ϵ2t−1 to both sides of (10.6) to give ϵ2t+σ2t=ω+α1ϵ2t−1+ϵ2t⇒ϵ2t=ω+α1ϵ2t−1+ϵ2t−σ2t=ω+α1ϵ2t−1+vt, where vt=ϵ2t−σ2t is a mean-zero and uncorrelated error term. Hence, ϵ2t and R2t follow an AR(1) process with positive autoregressive coefficient α1. The autocorrelations of R2t are cor(R2t,R2t−j)=αj1forj>1, which are positive and decay toward zero exponentially fast as j gets large. Here, the persistence of the autocorrelations is measured by α1.
To be completed
◼
10.1.1.2 The rugarch package
- Give a brief introduction to the rugarch package
Consider the simple ARCH(1) model for returns Rt=εt=σtzt, zt∼iid N(0,1)σ2t=0.5+0.5R2t−1 Here, α1=0.5<1 so that Rt is covariance stationary and ˉσ2=ω/(1−α1)=0.5/(1−0.5)=1. Using (10.9), kurt(Rt)=31−α211−3α21=31−(0.5)21−3(0.5)2=9, so that the distribution of Rt has much fatter tails than the normal distribution.
The rugarch functions ugarchspec()
and ugarchpath()
can be used to simulate T=1000 values of Rt and σt
from this model.53 The ARCH(1) model is specified using the ugarchpath()
function
as follows
library(rugarch)
library(PerformanceAnalytics)
ugarchspec(variance.model = list(garchOrder=c(1,0)),
arch1.spec =mean.model = list(armaOrder=c(0,0)),
fixed.pars=list(mu = 0, omega=0.5, alpha1=0.5))
The argument fixed.pars
is a list whose components give the
ARCH(1) model parameters. The names of the list components match the
parameters from the ARCH(1) model: mu
is μ, omega
is ω, and alpha1
is α1. Simulated values
of Rt and σt are produced using the ugarchpath()
function taking as input the "uGARCHspec"
object arch1.spec
and the number of simulations, n.sim=1000
, to produce:
set.seed(123)
ugarchpath(arch1.spec, n.sim=1000)
arch1.sim =class(arch1.sim)
## [1] "uGARCHpath"
## attr(,"package")
## [1] "rugarch"
slotNames(arch1.sim)
## [1] "path" "model" "seed"
names(arch1.sim@path)
## [1] "sigmaSim" "seriesSim" "residSim"
The object arch1.sim
is an Sv4 object of class uGARCHpath
for which there are sigma
, fitted
, quantile
,
show
and plot
methods.54 The path
slot is a list which contains the simulated values
of σt, Rt and εt=Rt−μt as
matrix
objects. The method functions sigma()
and
fitted()
extract σt and μt, respectively.
Invoking the plot()
method produces a menu of plot choices
Individual plots can be produced directly by specifying the plot number
in the call to plot()
. For example, Figure 10.1
shows the plots of the simulated values for Rt and σt
created with
library(sp)
::par(mfrow=c(2,1))
graphics::plot(arch1.sim, which=2)
sp::plot(arch1.sim, which=1) sp

Figure 10.1: Simulated values from ARCH(1) process. Top panel: simulated values of Rt. Bottom panel: simulated values of σt.
::par(mfrow=c(1,1)) graphics
Figure 10.2 shows the sample autocorrelations of Rt and R2t. As expected returns are uncorrelated whereas R2t has autocorrelations described by an AR(1) process with positive autoregressive coefficient.

Figure 10.2: SACFs for Rt(top panel) and R2t (bottom panel) from simulated ARCH(1) model.
The sample variance and excess kurtosis of the simulated returns are
as.numeric(var(arch1.sim@path$seriesSim))
## [1] 0.876
kurtosis(arch1.sim@path$seriesSim)
## [1] 0.65
The sample variance is very close to the unconditional variance ˉσ2=0.2, and the sample excess kurtosis is very close to the ARCH(1) excess kurtosis of 6.
◼
10.1.2 The ARCH(p) Model
The ARCH(p) model extends the autocorrelation structure of R2t and ϵ2t in the ARCH(1) model (10.4) - (10.6) to that of an AR(p) process by adding p lags of ϵ2t to the dynamic equation for σ2t: σ2t=ω+α1ϵ2t−1+α2ϵ2t−2+⋯+αpϵ2t−p, In order for σ2t to be positive we need to impose the restrictions ω>0,α1≥0,α2≥0,…,αp≥0. In addition, for {Rt} to be a covariance stationary time series we must have the restriction 0≤α1+α2+⋯+αp<1. The statistical properties of the ARCH(p) model are the same as those for the ARCH(1) model with the following exceptions. The unconditional variance of {Rt} is var(Rt)=E(ϵ2t)=E(σ2t)=ω/(1−α1−α2−⋯−αp), and {R2t} and {ϵ2t} have a covariance stationary AR(p) model representation whose autocorrelation persistence is measured by the sum of the ARCH coefficients α1+α2+⋯+αp.
The ARCH(p) model is capable of creating a much richer autocorrelation structure for R2t. In the ARCH(1) model, the autocorrelations of R2t decay to zero fairly quickly whereas the sample autocorrelations shown in Figure 10.2 decay to zero very slowly. In the ARCH(p) model, because of the restrictions (10.11) and (10.12), for large values of p the dynamics of σ2t from (10.10) can more closely mimic the observed autocorrelations of actual daily returns.
To be completed. Simulate long order AR(10) to show difference with ARCH(1). Make sum of ARCH coefficients equal to 0.9
◼
10.2 Bollerslev’s GARCH Model
An important extension of the ARCH(p) model proposed by Bollerslev (1986) replaces the AR(p) representation of σ2t in (10.10) with an ARMA(p,q) formulation giving the GARCH(p,q) model: σ2t=ω+p∑i=1αiε2t−i+q∑j=1βjσ2t−j In order for σ2t>0, it is assumed that ω>0 and the coefficients αi (i=0,⋯,p) and βj (j=1,⋯,q) are all non-negative.55 When q=0, the GARCH(p,q) model (10.14) reduces to the ARCH(p) model. In general, the GARCH(p,q) model can be shown to be equivalent to a particular ARCH(∞) model.
Usually the GARCH(1,1) model, σ2t=ω+α1ε2t−1+β1σ2t−1, with only three parameters in the conditional variance equation is adequate to obtain a good model fit for daily asset returns. Indeed, Hansen and Lund (2004) provided compelling evidence that is difficult to find a volatility model that outperforms the simple GARCH(1,1). Hence, for many purposes the GARCH(1,1) model is the de facto volatility model of choice for daily returns.
10.2.1 Statistical Properties of the GARCH(1,1) Model
The statistical properties of the GARCH(1,1) model are derived in the same way as the properties of the ARCH(1) model and are summarized below:
- {Rt} is a covariance stationary and ergodic process provided α1+β1<1.
- var(Rt)=E(ϵ2t)=E(σ2t)=ω/(1−α1−β1)=ˉσ2.
- The distribution of Rt conditional on It−1 is normal with mean μ and variance σ2t.
- The unconditional (marginal) distribution of Rt is not normal and kurt(Rt)=3(1+α1+β1)(1−α1−β1)1−2α1β1−3α21−β21≥3.
- {R2t} and {ε2t} have an ARMA(1,1) representation ε2t=ω+(α1+β1)ε2t−1+ut−β1ut−1, with ut=ε2t−σ2t. The persistence of the autocorrelations is given by α1+β1.
- σ2t has an AR(∞) representation σ2t=ω1−β1+α1∞∑j=0βj1ε2t−j−1.
The derivations of these properties are left as end-of-chapter exercise.
- Need some comments here: more flexible autocorrelations structure for R2t than ARCH(p) and with fewer parameters.
Consider simulating T=1000 observations from the GARCH(1,1) model
Rt=εt=σtzt, zt∼iid N(0,1)σ2t=0.01+0.07R2t−1+0.92σ2t−1
This process is covariance stationary since α1+β1=0.07+0.92=0.9<1,
and ˉσ2=ω/(1−α1−β1)=0.01/0.01=1.
This GARCH(1,1) model has the same unconditional variance as the ARCH(5)
model from the previous example but has much higher persistence. This
model can be specified using the rugarch ugarchspec()
function as follows:
ugarchspec(variance.model = list(garchOrder=c(1,1)),
garch11.spec =mean.model = list(armaOrder=c(0,0)),
fixed.pars=list(mu = 0, omega=0.1,
alpha1=0.1, beta1 = 0.8))
This model has the same unconditional variance and persistence as the ARCH(5) model in the previous example. Simulated values for Rt and σt, using the same random number seed as the ARCH(5) simulations, are created using
set.seed(123)
ugarchpath(garch11.spec, n.sim=1000) garch11.sim =
and are displayed in Figure xxx. The simulated GARCH(1,1) values of Rt and σt are quite different from the simulated ARCH(1) values. In particular, the simulated returns Rt show fewer extreme values than the ARCH(1) returns and the simulated volatilities σt values appear to show more persistence than the ARCH(1) volatilities. The sample autocorrelations in Figure xxx show the interesting result that R2t exhibits little autocorrelation whereas σ2t exhibits substantial positive autocorrelation.
sd(garch11.sim@path$seriesSim)
## [1] 0.963
kurtosis(garch11.sim@path$seriesSim)
## [1] 0.0443
◼
10.3 Maximum Likelihood Estimation
The estimation of the ARCH-GARCH model parameters is more complicated than the estimation of the CER model parameters. There are no simple plug-in principle estimators for the conditional variance parameters. Instead, an alternative estimation method called maximum likelihood (ML) is typically used to estimate the ARCH-GARCH parameters. This section reviews the ML estimation method and shows how it can be applied to estimate the ARCH-GARCH model parameters.
10.3.1 The Likelihood Function
Let X1,…,XT be an iid sample with probability density function (pdf) f(xt;θ), where θ is a (k×1) vector of parameters that characterize f(xt;θ). For example, if Xt∼N(μ,σ2) then f(xt;θ)=(2πσ2)−1/2exp(−12σ2(xt−μ)2) and θ=(μ,σ2)′. The joint density of the sample is, by independence, equal to the product of the marginal densities f(x1,…,xT;θ)=f(x1;θ)⋯f(xT;θ)=T∏t=1f(xt;θ). The joint density is a T dimensional function of the data x1,…,xT given the parameter vector θ. The joint density satisfies56 f(x1,…,xT;θ)≥0,∫⋯∫f(x1,…,xT;θ)dx1⋯dxT=1.
The likelihood function is defined as the joint density treated as a function of the parameters θ: L(θ|x1,…,xT)=f(x1,…,xT;θ)=T∏t=1f(xt;θ). Notice that the likelihood function is a k dimensional function of θ given the data x1,…,xT. It is important to keep in mind that the likelihood function, being a function of θ and not the data, is not a proper pdf. It is always positive but ∫⋯∫L(θ|x1,…,xT)dθ1⋯dθk≠1.
To simplify notation, let the vector x=(x1,…,xT)′ denote the observed sample. Then the joint pdf and likelihood function may be expressed as f(x;θ) and L(θ|x).
Let Rt denote the daily return on an asset and assume that {Rt}Tt=1 is described by the CER model. Then {Rt}Tt=1 is an iid sample with Rt∼N(μ,σ2). The pdf for Rt is f(rt;θ)=(2πσ2)−1/2exp(−12σ2(rt−μ)2), −∞<μ<∞, σ2>0, −∞<r<∞, so that θ=(μ,σ2)′. Then the likelihood function is given by L(θ|r)=T∏t=1(2πσ2)−1/2exp(−12σ2(rt−μ)2)=(2πσ2)−T/2exp(−12σ2T∑t=1(rt−μ)2), where r=(r1,…,rT)′ denotes the observed sample.
◼
Now suppose {Xt} is a covariance stationary time series whose joint pdf for a sample of size T is given by f(x1,…,xT;θ). Here, the factorization of the likelihood function given in (10.16) does not work because the random variables {Xt}Tt=1 generating the sample {xt}Tt=1 are not iid. In this case, to factorize the joint density one can use the conditional-marginal factorization of the joint density f(x1,…,xT;θ). To see how this works, consider the joint density of two adjacent observations f(x1,x2;θ). This joint density can be factored as the product of the conditional density of X2 given X1=x1 and the marginal density of X1: f(x1,x2;θ)=f(x2|x1;θ)f(x1;θ). For three observations, the factorization becomes f(x1,x2,x3;θ)=f(x3|x2,x1;θ)f(x2|x1;θ)f(x1;θ). For a sample of size T, the conditional-marginal factorization of the joint pdf f(x1,…,xT;θ) has the form f(x1,…,xT;θ)=(T∏t=p+1f(xt|It−1;θ))⋅f(x1,…,xp;θ), where It={xt,…,x1} denotes the conditioning information at time t, f(xt|It−1;θ) is the pdf of xt conditional on It, and f(x1,…,xp;θ) denotes the joint density of the p initial values x1,…,xp. Hence, the conditional-marginal factorization of the likelihood function is L(θ|x1,…,xT)=f(x1,…,xT;θ)=(T∏t=p+1f(xt|It−1;θ)⋅f(x1,…,xp;θ). For many covariance stationary time series models, the conditional pdf f(xt|It−1;θ) has a simple form whereas the marginal joint pdf f(x1,…,xp;θ) is complicated. For these models, the marginal joint pdf is often ignored in (10.20) which gives the simplified conditional likelihood function L(θ|x1,…,xT)≈(T∏t=p+1f(xt|It−1;θ). For reasonably large samples (e.g., T>100) the density of the initial values f(x1,…,xp;θ) has a negligible influence on the shape of the likelihood function (10.20) and can be safely ignored. In what follows we will only consider the conditional likelihood function (10.21).
Let Rt denote the daily return on an asset and assume that {Rt}Tt=1 is described by the GARCH(1,1) model Rt=μ+ϵt,ϵt=σtzt,σ2t=ω+α1ϵ2t−1+β1σ2t−1,zt∼iidN(0,1). From the properties of the GARCH(1,1) model we know that Rt|It−1∼N(μ,σ2t) and so f(rt|It−1;θ)=(2πσ2t)−1/2exp(−12σ2t(rt−μ)2), where θ=(μ,ω,α1,β1)′. The conditional likelihood (10.21) is then L(θ|r)=n∏i=1(2πσ2t)−1/2exp(−12σ2t(rt−μ)2). In (10.23), the values for σ2t are determined recursively from (10.15) starting at t=1 σ21=ω+α1ϵ20+β1σ20. Given values for the parameters ω, α1, β1 and initial values ϵ20 and σ20 the value for σ21 is determined from (10.24). For t=2, we have σ22=ω+α1ϵ21+β1σ21=ω+α1(r1−μ)2+β1σ21, where σ21 is determined from (10.24). Hence, the values for σ2t in (10.23) are determined recursively using the GARCH(1,1) variance equation (10.15) starting with (10.24).
◼
10.3.2 The Maximum Likelihood Estimator
Consider a sample of observations {xt}Tt=1 with joint pdf f(x1,…,xT;θ) and likelihood function L(θ|x1,…,xT). Suppose that Xt is a discrete random variable so that L(θ|x1,…,xT)=f(x1,…,xT;θ)=Pr(X1=x1,…,XT=xT). So for a given value of θ, say θ=θ0, L(θ0|x1,…,xT) gives the probability of the observed sample {xt}Tt=1 coming from a model in which θ0 is the true parameter. Now, some values of θ will lead to higher probabilities of observing {xt}Tt=1 than others. The value of θ that gives the highest probability for the observed sample {xt}Tt=1 is the value of θ that maximizes the likelihood function L(θ|x1,…,xT). This value of θ is called the maximum likelihood estimator (MLE) for θ. When Xt is a continuous random variable, f(x1,…,xT;θ) is not a joint probability but represents the height of the joint pdf as a function of {xt}Tt=1 for a given θ. However, the interpretation of the MLE is similar. It is the value of θ that makes the observed data most likely.
Formally, the MLE for θ, denoted ˆθmle, is the value of θ that maximizes L(θ|x). That is, ˆθmle solves the optimization problem max
It is often quite difficult to directly maximize L(\theta|\mathbf{x}). It is usually much easier to maximize the log-likelihood function \ln L(\theta|\mathbf{x}). Since \ln(\cdot) is a monotonically increasing function the value of the \theta that maximizes \ln L(\theta|\mathbf{x}) will also maximize L(\theta|\mathbf{x}). Therefore, we may also define \hat{\theta}_{mle} as the value of \theta that solves \max_{\theta}\ln L(\theta|\mathbf{x}). With random sampling, the log-likelihood is additive in the log of the marginal densities: \ln L(\theta|\mathbf{x})=\ln\left(\prod_{t=1}^{T}f(x_{t};\theta)\right)=\sum_{t=1}^{T}\ln f(x_{t};\theta). For a covariance stationary time series, the conditional log-likelihood is additive in the log of the conditional densities: \ln L(\theta|\mathbf{x})=\ln\left(\prod_{t=1}^{T}f(x_{t}|I_{t-1};\theta)\right)=\sum_{t=1}^{T}\ln f(x_{t}|I_{t-1};\theta).
Since the MLE is defined as a maximization problem, under suitable regularity conditions we can determine the MLE using simple calculus.57 We find the MLE by differentiating \ln L(\theta|\mathbf{x}) and solving the first order conditions \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta}=\mathbf{0}. Since \theta is (k\times1) the first order conditions define k, potentially nonlinear, equations in k unknown values: \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta}=\left(\begin{array}{c} \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta_{1}}\\ \vdots\\ \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta_{k}} \end{array}\right)=\left(\begin{array}{c} 0\\ \vdots\\ 0 \end{array}\right).
The vector of derivatives of the log-likelihood function is called the score vector and is denoted S(\theta|\mathbf{x})=\frac{\partial\ln L(\theta|\mathbf{x})}{\partial\theta}. By definition, the MLE satisfies S(\hat{\theta}_{mle}|\mathbf{x})=0.
In some cases, it is possible to find analytic solutions to the set of equations S(\hat{\theta}_{mle}|\mathbf{x})=\mathbf{0}. However, for ARCH and GARCH models the set of equations S(\hat{\theta}_{mle}|\mathbf{x})=0 are complicated nonlinear functions of the elements of \theta and no analytic solutions exist. As a result, numerical optimization methods are required to determine \hat{\theta}_{mle}.
In the CER model with likelihood function (10.18), the log-likelihood function is \begin{equation} \ln L(\theta|\mathbf{r})=-\frac{T}{2}\ln(2\pi)-\frac{T}{2}\ln(\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{t=1}^{T}(r_{t}-\mu)^{2},\tag{10.25} \end{equation} and we may determine the MLE for \theta=(\mu,\sigma^{2})^{\prime} by maximizing the log-likelihood (10.25). The sample score is a (2\times1) vector given by S(\theta|\mathbf{r})=\left(\begin{array}{c} \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\mu}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\sigma^{2}} \end{array}\right), where \begin{align*} \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\mu} & =\frac{1}{\sigma^{2}}\sum_{t=1}^{T}(r_{t}-\mu),\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\sigma^{2}} & =-\frac{T}{2}(\sigma^{2})^{-1}+\frac{1}{2}(\sigma^{2})^{-2}\sum_{i=1}^{T}(r_{t}-\mu)^{2}. \end{align*}
Solving S(\hat{\theta}_{mle}|\mathbf{r})=0 gives the \begin{align*} \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{r})}{\partial\mu} & =\frac{1}{\hat{\sigma}_{mle}^{2}}\sum_{t=1}^{T}(r_{t}-\hat{\mu}_{mle})=0\\ \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{r})}{\partial\sigma^{2}} & =-\frac{T}{2}(\hat{\sigma}_{mle}^{2})^{-1}+\frac{1}{2}(\hat{\sigma}_{mle}^{2})^{-2}\sum_{i=1}^{T}(r_{t}-\hat{\mu}_{mle})^{2}=0 \end{align*} Solving the first equation for \hat{\mu}_{mle} gives \begin{equation} \hat{\mu}_{mle}=\frac{1}{T}\sum_{i=1}^{T}r_{t}=\bar{r}.\tag{10.26} \end{equation} Hence, the sample average is the MLE for \mu. Using \hat{\mu}_{mle}=\bar{r} and solving the second equation for \hat{\sigma}_{mle}^{2} gives \begin{equation} \hat{\sigma}_{mle}^{2}=\frac{1}{T}\sum_{i=1}^{T}(r_{t}-\bar{r})^{2}.\tag{10.27} \end{equation} Notice that \hat{\sigma}_{mle}^{2} is not equal to the sample variance which uses a divisor of T-1 instead of T.
It is instructive to compare the MLEs for \mu and \sigma^{2} with the CER model estimates presented in Chapter 7. Recall, the CER model estimators for \mu and \sigma^{2} are motivated by the plug-in principle and are equal to the sample mean and variance, respectively. Here, we see that the MLE for \mu is equal to the sample mean and the MLE for \sigma^{2} is (T-1)/T times the sample variance. In fact, it can be shown that the CER model estimates for the other parameters are also equal to or approximately equal to the MLEs.
\blacksquare
In the GARCH(1,1) model with likelihood function (10.23) and parameter vector \theta=(\mu,\omega,\alpha_{1},\beta_{1})^{\prime}, the log-likelihood function is \begin{equation} \ln L(\theta|\mathbf{r})=-\frac{T}{2}\ln(2\pi)-\sum_{t=1}^{T}\ln(\sigma_{t}^{2}(\theta))-\frac{1}{2}\sum_{t=1}^{T}\frac{(r_{t}-\mu)^{2}}{\sigma_{t}^{2}(\theta)}.\tag{10.28} \end{equation} In (10.23), it instructive to write \sigma_{t}^{2}=\sigma_{t}^{2}(\theta) to emphasize that \sigma_{t}^{2} is a function of \theta. That is, from (10.15) \sigma_{t}^{2}=\sigma_{t}^{2}(\theta)=\omega+\alpha_{1}\epsilon_{t-1}^{2}+\beta_{1}\sigma_{t-1}^{2}(\theta)=\omega+\alpha_{1}(r_{t-1}-\mu)^{2}+\beta_{1}\sigma_{t-1}^{2}(\theta). The sample score is a (4\times1) vector given by S(\theta|\mathbf{r})=\left(\begin{array}{c} \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\mu}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\omega}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\alpha_{1}}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\beta_{1}} \end{array}\right). The elements of S(\theta|\mathbf{r}), unfortunately, do not have simple closed form expressions and no analytic formulas are available for the MLEs of the elements of \theta. As a result, numerical methods, as discussed in sub-section 10.3.6 below, are required to determine \hat{\theta}_{mle} that maximizes (10.28) and solves S(\hat{\theta}_{mle}|\mathbf{x})=0.
\blacksquare
10.3.3 Invariance Property of Maximum Likelihood Estimators
One of the attractive features of the method of maximum likelihood is its invariance to one-to-one transformations of the parameters of the log-likelihood. That is, if \hat{\theta}_{mle} is the MLE of \theta and \alpha=h(\theta) is a one-to-one function of \theta then \hat{\alpha}_{mle}=h(\hat{\theta}_{mle}) is the MLE for \alpha.
In the CER model, the log-likelihood is parametrized in terms of \mu and \sigma^{2} and we have the MLEs (10.26) and (10.27). Suppose we are interested in the MLE for \sigma=h(\sigma^{2})=(\sigma^{2})^{1/2}, which is a one-to-one function for \sigma^{2}>0. The invariance property of MLEs gives the result \hat{\sigma}_{mle}=(\hat{\sigma}_{mle}^{2})^{1/2}=\left(\frac{1}{T}\sum_{i=1}^{T}(r_{t}-\hat{\mu}_{mle})^{2}\right)^{1/2}.
10.3.4 The Precision of the Maximum Likelihood Estimator
The likelihood, log-likelihood and score functions for a typical model are illustrated in figure xxx. The likelihood function is always positive (since it is the joint density of the sample) but the log-likelihood function is typically negative (being the log of a number less than 1). Here the log-likelihood is globally concave and has a unique maximum at \hat{\theta}_{mle}. Consequently, the score function is positive to the left of the maximum, crosses zero at the maximum and becomes negative to the right of the maximum.
Intuitively, the precision of \hat{\theta}_{mle} depends on the curvature of the log-likelihood function near \hat{\theta}_{mle}. If the log-likelihood is very curved or “steep” around \hat{\theta}_{mle}, then \theta will be precisely estimated. In this case, we say that we have a lot of information about \theta. On the other hand, if the log-likelihood is not curved or “flat” near \hat{\theta}_{mle}, then \theta will not be precisely estimated. Accordingly, we say that we do not have much information about \theta.
The extreme case of a completely flat likelihood in \theta is illustrated in figure xxx. Here, the sample contains no information about the true value of \theta because every value of \theta produces the same value of the likelihood function. When this happens we say that \theta is not identified. Formally, \theta is identified if for all \theta_{1}\neq\theta_{2} there exists a sample \mathbf{x} for which L(\theta_{1}|\mathbf{x})\neq L(\theta_{2}|\mathbf{x}).
The curvature of the log-likelihood is measured by its second derivative (Hessian) H(\theta|\mathbf{x})=\frac{\partial^{2}\ln L(\theta|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}. Since the Hessian is negative semi-definite, the information in the sample about \theta may be measured by -H(\theta|\mathbf{x}). If \theta is a scalar then -H(\theta|\mathbf{x}) is a positive number. The expected amount of information in the sample about the parameter \theta is the information matrix I(\theta|\mathbf{x})=-E[H(\theta|\mathbf{x})]. As we shall see in the next sub-section, the information matrix is directly related to the precision of the MLE.
10.3.5 Asymptotic Properties of Maximum Likelihood Estimators
- Write this section in a similar way as the asymptotic section in the chapter on Estimating the CER model. Many of the concepts are exactly the same.
- Draw connection between asymptotic results here and those in previous chapter.
Let \{X_{t}\} be a covariance stationary time series with conditional probability density function (pdf) f(x_{t};I_{t-1};\theta), where \theta is a (k\times1) vector of parameters that characterize f(x_{t};I_{t-1};\theta). Under suitable regularity conditions, the ML estimator of \theta has the following asymptotic properties as the sample size
- \hat{\theta}_{mle}\overset{p}{\rightarrow}\theta
- \sqrt{n}(\hat{\theta}_{mle}-\theta)\overset{d}{\rightarrow}N(0,I(\theta|x_{t})^{-1}), where I(\theta|x_{t})=-E\left[H(\theta|x_{t})\right]=-E\left[\frac{\partial^{2}\ln f(\theta|x_{t})}{\partial\theta\partial\theta^{\prime}}\right] That is, \mathrm{avar}(\sqrt{n}(\hat{\theta}_{mle}-\theta))=I(\theta|x_{t})^{-1} Alternatively, \hat{\theta}_{mle}\sim N\left(\theta,\frac{1}{n}I(\theta|x_{t})^{-1}\right)=N(\theta,I(\theta|\mathbf{x})^{-1}) where I(\theta|\mathbf{x})=nI(\theta|x_{t})= information matrix for the sample.
- \hat{\theta}_{mle} is efficient in the class of consistent and asymptotically normal estimators. That is, \mathrm{avar}(\sqrt{n}(\hat{\theta}_{mle}-\theta))-\mathrm{avar}(\sqrt{n}(\tilde{\theta}-\theta))\leq0 for any consistent and asymptotically normal estimator \tilde{\theta}.
10.3.6 Computing the MLE Using Numerical Optimization Methods
Computation: Newton-Raphson Iteration
Goal: Use an iterative scheme to compute \hat{\theta}=\arg\max_{\theta}\text{ }\ln L(\theta|\mathbf{x}) Numerical optimization methods generally have the following structure:
- Determine a set of starting values
- Update the starting values in a direction that increases \ln L(\theta|\mathbf{x})
- Check to see if the update satisfies the FOCs. If not, update again.
- Stop updating when the FOCs are satisfied.
The most common method for numerically maximizing () is . This method can be motivated as follows. Consider a second order Taylor’s series expansion of \ln L(\theta|\mathbf{x}) about starting value \hat{\theta}_{1} \begin{align} \ln L(\theta|\mathbf{x}) & =\ln L(\hat{\theta}_{1}|\mathbf{x})+\frac{\partial\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta^{\prime}}(\theta-\hat{\theta}_{1})\tag{10.29}\\ & +\frac{1}{2}(\theta-\hat{\theta}_{1})^{\prime}\frac{\partial^{2}\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}(\theta-\hat{\theta}_{1})+error.\nonumber \end{align} Now maximize (10.29) with respect to \theta. The FOCs are \underset{p\times1}{\mathbf{0}}=\frac{\partial\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta}+\frac{\partial^{2}\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}(\hat{\theta}_{2}-\hat{\theta}_{1}),
which can be solved for \hat{\theta}_{2}: \begin{align*} \hat{\theta}_{2} & =\hat{\theta}_{1}-\left[\frac{\partial^{2}\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}\right]^{-1}\frac{\partial\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta}\\ & =\hat{\theta}_{1}-H(\hat{\theta}_{1}|\mathbf{x})^{-1}S(\hat{\theta}_{1}|\mathbf{x}) \end{align*} This suggests the iterative scheme \hat{\theta}_{n+1}=\hat{\theta}_{n}-H(\hat{\theta}_{n}|\mathbf{x})^{-1}S(\hat{\theta}_{n}|\mathbf{x}) Iteration stops when S(\hat{\theta}_{n}|\mathbf{x})\approx0.
- Practical considerations: Stopping criterion
- Practical considerations: analytic or numerical derivatives
- Practical considerations: NR gives H(\hat{\theta}_{n}|\mathbf{x})^{-1} as a by-product of the optimization and this is used to compute the asymptotic standard errors of the MLE
10.5 Forecasting Conditional Volatility from ARCH Models
An important task of modeling conditional volatility is to generate accurate forecasts for both the future value of a financial time series as well as its conditional volatility. Volatility forecasts are used for risk management, option pricing, portfolio allocation, trading strategies and model evaluation. ARCH and GARCH models can generate accurate forecasts of future daily return volatility, especially over short horizons, and these forecasts will eventually converge to the unconditional volatility of daily returns. This section illustrates how to forecast volatility using the GARCH(1,1) model.
10.5.1 Forecasting daily return volatility from the GARCH(1,1) model
For simplicity, consider the basic GARCH(1,1) model \begin{eqnarray*} R_{t} & = & \mu+\epsilon_{t}\\ \epsilon_{t} & = & \sigma_{t}z_{t}\\ \sigma_{t}^{2} & = & \omega+\alpha_{1}\epsilon_{t-1}^{2}+\beta_{1}\sigma_{t-1}^{2}\\ z_{t} & \sim & iid\,N(0,1) \end{eqnarray*} Assume the sample period is t=1,2,\cdots,T and forecasts for volatility are desired for the out-of-sample periods T+1,T+2,\ldots,T+k. For ease of notation, assume the GARCH(1,1) parameters are known.58 The optimal, in terms of mean-squared error, forecast of \sigma_{T+k}^{2} given information at time T is E[\sigma_{T+k}^{2}|I_{T}] and can be computed using a simple recursion known as the chain-rule of forecasting. For k=1, \begin{align} E[\sigma_{T+1}^{2}|I_{T}] & =\omega+\alpha_{1}E[\varepsilon_{T}^{2}|I_{T}]+\beta_{1}E[\sigma_{T}^{2}|I_{T}]\tag{10.30}\\ \quad & =\omega+\alpha_{1}\varepsilon_{T}^{2}+\beta_{1}\sigma_{T}^{2},\nonumber \end{align} where it is assumed that \varepsilon_{T}^{2} and \sigma_{T}^{2} are known at time T. Similarly, for k=2 \begin{align} E[\sigma_{T+2}^{2}|I_{T}] & =\omega+\alpha_{1}E[\varepsilon_{T+1}^{2}|I_{T}]+\beta_{1}E[\sigma_{T+1}^{2}|I_{T}].\tag{10.31} \end{align} Now, in the GARCH(1,1) model, E[\varepsilon_{T+1}^{2}|I_{T}]=E[z_{T+1}^{2}\sigma_{T+1}^{2}|I_{T}]=E[\sigma_{T+1}^{2}|I_{T}] so that (10.30) becomes \begin{eqnarray*} E[\sigma_{T+2}^{2}|I_{T}] & = & \omega+(\alpha_{1}+\beta_{1})E[\sigma_{T+1}^{2}|I_{T}]\\ & = & \omega+(\alpha_{1}+\beta_{1})\left(\alpha_{1}\varepsilon_{T}^{2}+\beta_{1}\sigma_{T}^{2}\right) \end{eqnarray*} In general, for k\geq2 we have \begin{align} E[\sigma_{T+k}^{2}|I_{T}] & =\omega+(\alpha_{1}+\beta_{1})E[\sigma_{T+k-1}^{2}|I_{T}]\tag{10.32} \end{align} which, by recursive substitution, reduces to E[\sigma_{T+k}^{2}|I_{T}]=\omega\sum_{i=0}^{k-1}(\alpha_{1}+\beta_{1})^{i}+(\alpha_{1}+\beta_{1})^{k-1}(\alpha_{1}\varepsilon_{T}^{2}+\beta_{1}\sigma_{T}^{2}). Now, since 0<\alpha_{1}+\beta_{1}<1, as k\rightarrow\infty \begin{eqnarray*} \sum_{i=0}^{k-1}(\alpha_{1}+\beta_{1})^{i} & \rightarrow & \frac{1}{1-(\alpha_{1}+\beta_{1})},\\ (\alpha_{1}+\beta_{1})^{k-1} & \rightarrow & 0, \end{eqnarray*} and so E[\sigma_{T+k}^{2}|I_{T}]\rightarrow\frac{\omega}{1-(\alpha_{1}+\beta_{1})}=\bar{\sigma}^{2}=\mathrm{var}(R_{t}). Notice that the speed at which E[\sigma_{T+k}^{2}|I_{T}] approaches \bar{\sigma}^{2} is captured by approaches \bar{\sigma}^{2} is captured by the GARCH(1,1) persistence \alpha_{1}+\beta_{1}.
An alternative representation of the forecasting equation (10.32) starts with the mean-adjusted form \sigma_{T+1}^{2}-\bar{\sigma}^{2}=\alpha_{1}(\varepsilon_{T}^{2}-\bar{\sigma}^{2})+\beta_{1}(\sigma_{T}^{2}-\bar{\sigma}^{2}), where \bar{\sigma}^{2}=\omega/(1-\alpha_{1}-\beta_{1}) is the unconditional variance. Then by recursive substitution E[\sigma_{T+k}^{2}|I_{T}]-\bar{\sigma}^{2}=(\alpha_{1}+\beta_{1})^{k-1}(E[\sigma_{T+1}^{2}|I_{T}]-\bar{\sigma}^{2}).
The forecasting algorithm (10.32) produces unbiased forecasts for the conditional variance \sigma_{T+k}^{2}. The forecast for the conditional volatility, \sigma_{T+k}, is computed as the square root of the forecast for \sigma_{T+k}^{2} which is not unbiased because E[\sigma_{T+k}|I_{T}]\neq\sqrt{E[\sigma_{T+k}^{2}|I_{T}]}.
To be completed
\blacksquare
References
Conrad, C., and B. R. Haag. 2006. Inequality Constraints in the Fractionally Integrated Garch Model. Journal of Financial Econometrics. Vol. 4.
Nelson, D. B., and Charles Q. Cao. 1992. Inequality Constraints in the Univariate Garch Model. Journal of Business & Economic Statistics. Vol. 10.
More details on using the rugarch functions are given later in the chapter.↩︎
In R there are two main object types: Sv3 and Sv4. The main operational difference between Sv3 and Sv4 objects is that Sv4 components are extracted using the
@
character instead of the$
character. Also, the names of Sv4 objects are extracted using theslotNames()
function instead of thenames()
function. ↩︎Positive coefficients are sufficient but not necessary conditions for the positivity of conditional variance. See (Nelson and Cao 1992) and (Conrad and Haag 2006) for more general conditions.↩︎
If X_{1},\ldots,X_{T} are discrete random variables, then f(x_{1},\ldots,x_{T};\theta)=\Pr(X_{1}=x_{1},\ldots,X_{T}=x_{T}) for a fixed value of \theta.↩︎
Standard regularity conditions are: (1) the support of the random variables X,\,S_{X}=\{x:f(x;\theta)>0\}, does not depend on \theta; (2) f(x;\theta) is at least three times differentiable with respect to \theta; (3) the true value of \theta lies in a compact set \Theta. ↩︎
In practice, the GARCH forecasting algorithm will use the GARCH parameters estimated over the sample period.↩︎