10.3 Maximum Likelihood Estimation
The estimation of the ARCH-GARCH model parameters is more complicated than the estimation of the CER model parameters. There are no simple plug-in principle estimators for the conditional variance parameters. Instead, an alternative estimation method called maximum likelihood (ML) is typically used to estimate the ARCH-GARCH parameters. This section reviews the ML estimation method and shows how it can be applied to estimate the ARCH-GARCH model parameters.
10.3.1 The Likelihood Function
Let X1,…,XT be an iid sample with probability density function (pdf) f(xt;θ), where θ is a (k×1) vector of parameters that characterize f(xt;θ). For example, if Xt∼N(μ,σ2) then f(xt;θ)=(2πσ2)−1/2exp(−12σ2(xt−μ)2) and θ=(μ,σ2)′. The joint density of the sample is, by independence, equal to the product of the marginal densities f(x1,…,xT;θ)=f(x1;θ)⋯f(xT;θ)=T∏t=1f(xt;θ). The joint density is a T dimensional function of the data x1,…,xT given the parameter vector θ. The joint density satisfies56 f(x1,…,xT;θ)≥0,∫⋯∫f(x1,…,xT;θ)dx1⋯dxT=1.
The likelihood function is defined as the joint density treated as a function of the parameters θ: L(θ|x1,…,xT)=f(x1,…,xT;θ)=T∏t=1f(xt;θ). Notice that the likelihood function is a k dimensional function of θ given the data x1,…,xT. It is important to keep in mind that the likelihood function, being a function of θ and not the data, is not a proper pdf. It is always positive but ∫⋯∫L(θ|x1,…,xT)dθ1⋯dθk≠1.
To simplify notation, let the vector x=(x1,…,xT)′ denote the observed sample. Then the joint pdf and likelihood function may be expressed as f(x;θ) and L(θ|x).
Let Rt denote the daily return on an asset and assume that {Rt}Tt=1 is described by the CER model. Then {Rt}Tt=1 is an iid sample with Rt∼N(μ,σ2). The pdf for Rt is f(rt;θ)=(2πσ2)−1/2exp(−12σ2(rt−μ)2), −∞<μ<∞, σ2>0, −∞<r<∞, so that θ=(μ,σ2)′. Then the likelihood function is given by L(θ|r)=T∏t=1(2πσ2)−1/2exp(−12σ2(rt−μ)2)=(2πσ2)−T/2exp(−12σ2T∑t=1(rt−μ)2), where r=(r1,…,rT)′ denotes the observed sample.
◼
Now suppose {Xt} is a covariance stationary time series whose joint pdf for a sample of size T is given by f(x1,…,xT;θ). Here, the factorization of the likelihood function given in (10.16) does not work because the random variables {Xt}Tt=1 generating the sample {xt}Tt=1 are not iid. In this case, to factorize the joint density one can use the conditional-marginal factorization of the joint density f(x1,…,xT;θ). To see how this works, consider the joint density of two adjacent observations f(x1,x2;θ). This joint density can be factored as the product of the conditional density of X2 given X1=x1 and the marginal density of X1: f(x1,x2;θ)=f(x2|x1;θ)f(x1;θ). For three observations, the factorization becomes f(x1,x2,x3;θ)=f(x3|x2,x1;θ)f(x2|x1;θ)f(x1;θ). For a sample of size T, the conditional-marginal factorization of the joint pdf f(x1,…,xT;θ) has the form f(x1,…,xT;θ)=(T∏t=p+1f(xt|It−1;θ))⋅f(x1,…,xp;θ), where It={xt,…,x1} denotes the conditioning information at time t, f(xt|It−1;θ) is the pdf of xt conditional on It, and f(x1,…,xp;θ) denotes the joint density of the p initial values x1,…,xp. Hence, the conditional-marginal factorization of the likelihood function is L(θ|x1,…,xT)=f(x1,…,xT;θ)=(T∏t=p+1f(xt|It−1;θ)⋅f(x1,…,xp;θ). For many covariance stationary time series models, the conditional pdf f(xt|It−1;θ) has a simple form whereas the marginal joint pdf f(x1,…,xp;θ) is complicated. For these models, the marginal joint pdf is often ignored in (10.20) which gives the simplified conditional likelihood function L(θ|x1,…,xT)≈(T∏t=p+1f(xt|It−1;θ). For reasonably large samples (e.g., T>100) the density of the initial values f(x1,…,xp;θ) has a negligible influence on the shape of the likelihood function (10.20) and can be safely ignored. In what follows we will only consider the conditional likelihood function (10.21).
Let Rt denote the daily return on an asset and assume that {Rt}Tt=1 is described by the GARCH(1,1) model Rt=μ+ϵt,ϵt=σtzt,σ2t=ω+α1ϵ2t−1+β1σ2t−1,zt∼iidN(0,1). From the properties of the GARCH(1,1) model we know that Rt|It−1∼N(μ,σ2t) and so f(rt|It−1;θ)=(2πσ2t)−1/2exp(−12σ2t(rt−μ)2), where θ=(μ,ω,α1,β1)′. The conditional likelihood (10.21) is then L(θ|r)=n∏i=1(2πσ2t)−1/2exp(−12σ2t(rt−μ)2). In (10.23), the values for σ2t are determined recursively from (10.15) starting at t=1 σ21=ω+α1ϵ20+β1σ20. Given values for the parameters ω, α1, β1 and initial values ϵ20 and σ20 the value for σ21 is determined from (10.24). For t=2, we have σ22=ω+α1ϵ21+β1σ21=ω+α1(r1−μ)2+β1σ21, where σ21 is determined from (10.24). Hence, the values for σ2t in (10.23) are determined recursively using the GARCH(1,1) variance equation (10.15) starting with (10.24).
◼
10.3.2 The Maximum Likelihood Estimator
Consider a sample of observations {xt}Tt=1 with joint pdf f(x1,…,xT;θ) and likelihood function L(θ|x1,…,xT). Suppose that Xt is a discrete random variable so that L(θ|x1,…,xT)=f(x1,…,xT;θ)=Pr(X1=x1,…,XT=xT). So for a given value of θ, say θ=θ0, L(θ0|x1,…,xT) gives the probability of the observed sample {xt}Tt=1 coming from a model in which θ0 is the true parameter. Now, some values of θ will lead to higher probabilities of observing {xt}Tt=1 than others. The value of θ that gives the highest probability for the observed sample {xt}Tt=1 is the value of θ that maximizes the likelihood function L(θ|x1,…,xT). This value of θ is called the maximum likelihood estimator (MLE) for θ. When Xt is a continuous random variable, f(x1,…,xT;θ) is not a joint probability but represents the height of the joint pdf as a function of {xt}Tt=1 for a given θ. However, the interpretation of the MLE is similar. It is the value of θ that makes the observed data most likely.
Formally, the MLE for θ, denoted ˆθmle, is the value of θ that maximizes L(θ|x). That is, ˆθmle solves the optimization problem max
It is often quite difficult to directly maximize L(\theta|\mathbf{x}). It is usually much easier to maximize the log-likelihood function \ln L(\theta|\mathbf{x}). Since \ln(\cdot) is a monotonically increasing function the value of the \theta that maximizes \ln L(\theta|\mathbf{x}) will also maximize L(\theta|\mathbf{x}). Therefore, we may also define \hat{\theta}_{mle} as the value of \theta that solves \max_{\theta}\ln L(\theta|\mathbf{x}). With random sampling, the log-likelihood is additive in the log of the marginal densities: \ln L(\theta|\mathbf{x})=\ln\left(\prod_{t=1}^{T}f(x_{t};\theta)\right)=\sum_{t=1}^{T}\ln f(x_{t};\theta). For a covariance stationary time series, the conditional log-likelihood is additive in the log of the conditional densities: \ln L(\theta|\mathbf{x})=\ln\left(\prod_{t=1}^{T}f(x_{t}|I_{t-1};\theta)\right)=\sum_{t=1}^{T}\ln f(x_{t}|I_{t-1};\theta).
Since the MLE is defined as a maximization problem, under suitable regularity conditions we can determine the MLE using simple calculus.57 We find the MLE by differentiating \ln L(\theta|\mathbf{x}) and solving the first order conditions \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta}=\mathbf{0}. Since \theta is (k\times1) the first order conditions define k, potentially nonlinear, equations in k unknown values: \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta}=\left(\begin{array}{c} \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta_{1}}\\ \vdots\\ \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{x})}{\partial\theta_{k}} \end{array}\right)=\left(\begin{array}{c} 0\\ \vdots\\ 0 \end{array}\right).
The vector of derivatives of the log-likelihood function is called the score vector and is denoted S(\theta|\mathbf{x})=\frac{\partial\ln L(\theta|\mathbf{x})}{\partial\theta}. By definition, the MLE satisfies S(\hat{\theta}_{mle}|\mathbf{x})=0.
In some cases, it is possible to find analytic solutions to the set of equations S(\hat{\theta}_{mle}|\mathbf{x})=\mathbf{0}. However, for ARCH and GARCH models the set of equations S(\hat{\theta}_{mle}|\mathbf{x})=0 are complicated nonlinear functions of the elements of \theta and no analytic solutions exist. As a result, numerical optimization methods are required to determine \hat{\theta}_{mle}.
In the CER model with likelihood function (10.18), the log-likelihood function is \begin{equation} \ln L(\theta|\mathbf{r})=-\frac{T}{2}\ln(2\pi)-\frac{T}{2}\ln(\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{t=1}^{T}(r_{t}-\mu)^{2},\tag{10.25} \end{equation} and we may determine the MLE for \theta=(\mu,\sigma^{2})^{\prime} by maximizing the log-likelihood (10.25). The sample score is a (2\times1) vector given by S(\theta|\mathbf{r})=\left(\begin{array}{c} \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\mu}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\sigma^{2}} \end{array}\right), where \begin{align*} \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\mu} & =\frac{1}{\sigma^{2}}\sum_{t=1}^{T}(r_{t}-\mu),\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\sigma^{2}} & =-\frac{T}{2}(\sigma^{2})^{-1}+\frac{1}{2}(\sigma^{2})^{-2}\sum_{i=1}^{T}(r_{t}-\mu)^{2}. \end{align*}
Solving S(\hat{\theta}_{mle}|\mathbf{r})=0 gives the \begin{align*} \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{r})}{\partial\mu} & =\frac{1}{\hat{\sigma}_{mle}^{2}}\sum_{t=1}^{T}(r_{t}-\hat{\mu}_{mle})=0\\ \frac{\partial\ln L(\hat{\theta}_{mle}|\mathbf{r})}{\partial\sigma^{2}} & =-\frac{T}{2}(\hat{\sigma}_{mle}^{2})^{-1}+\frac{1}{2}(\hat{\sigma}_{mle}^{2})^{-2}\sum_{i=1}^{T}(r_{t}-\hat{\mu}_{mle})^{2}=0 \end{align*} Solving the first equation for \hat{\mu}_{mle} gives \begin{equation} \hat{\mu}_{mle}=\frac{1}{T}\sum_{i=1}^{T}r_{t}=\bar{r}.\tag{10.26} \end{equation} Hence, the sample average is the MLE for \mu. Using \hat{\mu}_{mle}=\bar{r} and solving the second equation for \hat{\sigma}_{mle}^{2} gives \begin{equation} \hat{\sigma}_{mle}^{2}=\frac{1}{T}\sum_{i=1}^{T}(r_{t}-\bar{r})^{2}.\tag{10.27} \end{equation} Notice that \hat{\sigma}_{mle}^{2} is not equal to the sample variance which uses a divisor of T-1 instead of T.
It is instructive to compare the MLEs for \mu and \sigma^{2} with the CER model estimates presented in Chapter 7. Recall, the CER model estimators for \mu and \sigma^{2} are motivated by the plug-in principle and are equal to the sample mean and variance, respectively. Here, we see that the MLE for \mu is equal to the sample mean and the MLE for \sigma^{2} is (T-1)/T times the sample variance. In fact, it can be shown that the CER model estimates for the other parameters are also equal to or approximately equal to the MLEs.
\blacksquare
In the GARCH(1,1) model with likelihood function (10.23) and parameter vector \theta=(\mu,\omega,\alpha_{1},\beta_{1})^{\prime}, the log-likelihood function is \begin{equation} \ln L(\theta|\mathbf{r})=-\frac{T}{2}\ln(2\pi)-\sum_{t=1}^{T}\ln(\sigma_{t}^{2}(\theta))-\frac{1}{2}\sum_{t=1}^{T}\frac{(r_{t}-\mu)^{2}}{\sigma_{t}^{2}(\theta)}.\tag{10.28} \end{equation} In (10.23), it instructive to write \sigma_{t}^{2}=\sigma_{t}^{2}(\theta) to emphasize that \sigma_{t}^{2} is a function of \theta. That is, from (10.15) \sigma_{t}^{2}=\sigma_{t}^{2}(\theta)=\omega+\alpha_{1}\epsilon_{t-1}^{2}+\beta_{1}\sigma_{t-1}^{2}(\theta)=\omega+\alpha_{1}(r_{t-1}-\mu)^{2}+\beta_{1}\sigma_{t-1}^{2}(\theta). The sample score is a (4\times1) vector given by S(\theta|\mathbf{r})=\left(\begin{array}{c} \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\mu}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\omega}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\alpha_{1}}\\ \frac{\partial\ln L(\theta|\mathbf{r})}{\partial\beta_{1}} \end{array}\right). The elements of S(\theta|\mathbf{r}), unfortunately, do not have simple closed form expressions and no analytic formulas are available for the MLEs of the elements of \theta. As a result, numerical methods, as discussed in sub-section 10.3.6 below, are required to determine \hat{\theta}_{mle} that maximizes (10.28) and solves S(\hat{\theta}_{mle}|\mathbf{x})=0.
\blacksquare
10.3.3 Invariance Property of Maximum Likelihood Estimators
One of the attractive features of the method of maximum likelihood is its invariance to one-to-one transformations of the parameters of the log-likelihood. That is, if \hat{\theta}_{mle} is the MLE of \theta and \alpha=h(\theta) is a one-to-one function of \theta then \hat{\alpha}_{mle}=h(\hat{\theta}_{mle}) is the MLE for \alpha.
In the CER model, the log-likelihood is parametrized in terms of \mu and \sigma^{2} and we have the MLEs (10.26) and (10.27). Suppose we are interested in the MLE for \sigma=h(\sigma^{2})=(\sigma^{2})^{1/2}, which is a one-to-one function for \sigma^{2}>0. The invariance property of MLEs gives the result \hat{\sigma}_{mle}=(\hat{\sigma}_{mle}^{2})^{1/2}=\left(\frac{1}{T}\sum_{i=1}^{T}(r_{t}-\hat{\mu}_{mle})^{2}\right)^{1/2}.
10.3.4 The Precision of the Maximum Likelihood Estimator
The likelihood, log-likelihood and score functions for a typical model are illustrated in figure xxx. The likelihood function is always positive (since it is the joint density of the sample) but the log-likelihood function is typically negative (being the log of a number less than 1). Here the log-likelihood is globally concave and has a unique maximum at \hat{\theta}_{mle}. Consequently, the score function is positive to the left of the maximum, crosses zero at the maximum and becomes negative to the right of the maximum.
Intuitively, the precision of \hat{\theta}_{mle} depends on the curvature of the log-likelihood function near \hat{\theta}_{mle}. If the log-likelihood is very curved or “steep” around \hat{\theta}_{mle}, then \theta will be precisely estimated. In this case, we say that we have a lot of information about \theta. On the other hand, if the log-likelihood is not curved or “flat” near \hat{\theta}_{mle}, then \theta will not be precisely estimated. Accordingly, we say that we do not have much information about \theta.
The extreme case of a completely flat likelihood in \theta is illustrated in figure xxx. Here, the sample contains no information about the true value of \theta because every value of \theta produces the same value of the likelihood function. When this happens we say that \theta is not identified. Formally, \theta is identified if for all \theta_{1}\neq\theta_{2} there exists a sample \mathbf{x} for which L(\theta_{1}|\mathbf{x})\neq L(\theta_{2}|\mathbf{x}).
The curvature of the log-likelihood is measured by its second derivative (Hessian) H(\theta|\mathbf{x})=\frac{\partial^{2}\ln L(\theta|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}. Since the Hessian is negative semi-definite, the information in the sample about \theta may be measured by -H(\theta|\mathbf{x}). If \theta is a scalar then -H(\theta|\mathbf{x}) is a positive number. The expected amount of information in the sample about the parameter \theta is the information matrix I(\theta|\mathbf{x})=-E[H(\theta|\mathbf{x})]. As we shall see in the next sub-section, the information matrix is directly related to the precision of the MLE.
10.3.5 Asymptotic Properties of Maximum Likelihood Estimators
- Write this section in a similar way as the asymptotic section in the chapter on Estimating the CER model. Many of the concepts are exactly the same.
- Draw connection between asymptotic results here and those in previous chapter.
Let \{X_{t}\} be a covariance stationary time series with conditional probability density function (pdf) f(x_{t};I_{t-1};\theta), where \theta is a (k\times1) vector of parameters that characterize f(x_{t};I_{t-1};\theta). Under suitable regularity conditions, the ML estimator of \theta has the following asymptotic properties as the sample size
- \hat{\theta}_{mle}\overset{p}{\rightarrow}\theta
- \sqrt{n}(\hat{\theta}_{mle}-\theta)\overset{d}{\rightarrow}N(0,I(\theta|x_{t})^{-1}), where I(\theta|x_{t})=-E\left[H(\theta|x_{t})\right]=-E\left[\frac{\partial^{2}\ln f(\theta|x_{t})}{\partial\theta\partial\theta^{\prime}}\right] That is, \mathrm{avar}(\sqrt{n}(\hat{\theta}_{mle}-\theta))=I(\theta|x_{t})^{-1} Alternatively, \hat{\theta}_{mle}\sim N\left(\theta,\frac{1}{n}I(\theta|x_{t})^{-1}\right)=N(\theta,I(\theta|\mathbf{x})^{-1}) where I(\theta|\mathbf{x})=nI(\theta|x_{t})= information matrix for the sample.
- \hat{\theta}_{mle} is efficient in the class of consistent and asymptotically normal estimators. That is, \mathrm{avar}(\sqrt{n}(\hat{\theta}_{mle}-\theta))-\mathrm{avar}(\sqrt{n}(\tilde{\theta}-\theta))\leq0 for any consistent and asymptotically normal estimator \tilde{\theta}.
10.3.6 Computing the MLE Using Numerical Optimization Methods
Computation: Newton-Raphson Iteration
Goal: Use an iterative scheme to compute \hat{\theta}=\arg\max_{\theta}\text{ }\ln L(\theta|\mathbf{x}) Numerical optimization methods generally have the following structure:
- Determine a set of starting values
- Update the starting values in a direction that increases \ln L(\theta|\mathbf{x})
- Check to see if the update satisfies the FOCs. If not, update again.
- Stop updating when the FOCs are satisfied.
The most common method for numerically maximizing () is . This method can be motivated as follows. Consider a second order Taylor’s series expansion of \ln L(\theta|\mathbf{x}) about starting value \hat{\theta}_{1} \begin{align} \ln L(\theta|\mathbf{x}) & =\ln L(\hat{\theta}_{1}|\mathbf{x})+\frac{\partial\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta^{\prime}}(\theta-\hat{\theta}_{1})\tag{10.29}\\ & +\frac{1}{2}(\theta-\hat{\theta}_{1})^{\prime}\frac{\partial^{2}\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}(\theta-\hat{\theta}_{1})+error.\nonumber \end{align} Now maximize (10.29) with respect to \theta. The FOCs are \underset{p\times1}{\mathbf{0}}=\frac{\partial\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta}+\frac{\partial^{2}\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}(\hat{\theta}_{2}-\hat{\theta}_{1}),
which can be solved for \hat{\theta}_{2}: \begin{align*} \hat{\theta}_{2} & =\hat{\theta}_{1}-\left[\frac{\partial^{2}\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta\partial\theta^{\prime}}\right]^{-1}\frac{\partial\ln L(\hat{\theta}_{1}|\mathbf{x})}{\partial\theta}\\ & =\hat{\theta}_{1}-H(\hat{\theta}_{1}|\mathbf{x})^{-1}S(\hat{\theta}_{1}|\mathbf{x}) \end{align*} This suggests the iterative scheme \hat{\theta}_{n+1}=\hat{\theta}_{n}-H(\hat{\theta}_{n}|\mathbf{x})^{-1}S(\hat{\theta}_{n}|\mathbf{x}) Iteration stops when S(\hat{\theta}_{n}|\mathbf{x})\approx0.
- Practical considerations: Stopping criterion
- Practical considerations: analytic or numerical derivatives
- Practical considerations: NR gives H(\hat{\theta}_{n}|\mathbf{x})^{-1} as a by-product of the optimization and this is used to compute the asymptotic standard errors of the MLE
If X_{1},\ldots,X_{T} are discrete random variables, then f(x_{1},\ldots,x_{T};\theta)=\Pr(X_{1}=x_{1},\ldots,X_{T}=x_{T}) for a fixed value of \theta.↩︎
Standard regularity conditions are: (1) the support of the random variables X,\,S_{X}=\{x:f(x;\theta)>0\}, does not depend on \theta; (2) f(x;\theta) is at least three times differentiable with respect to \theta; (3) the true value of \theta lies in a compact set \Theta. ↩︎